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Abstract 



Concentration inequalities have been the subject of exciting developments during the last two decades, 
and they have been intensively studied and used as a powerful tool in various areas. These include convex 
geometry, functional analysis, statistical physics, statistics, pure and applied probability theory (e.g., 
concentration of measure phenomena in random graphs, random matrices and percolation), information 
theory, learning theory, dynamical systems and randomized algorithms. 

This tutorial article is focused on some of the key modern mathematical tools that are used for 
the derivation of concentration inequalities, on their links to information theory, and on their various 
applications to communications and coding. 

The first part of this article introduces some classical concentration inequalities for martingales, and 
it also derives some recent refinements of these inequalities. The power and versatility of the martingale 
approach is exemplified in the context of binary hypothesis testing, codes defined on graphs and iterative 
decoding algorithms, and some other aspects that are related to wireless communications and coding. 

The second part of this article introduces the entropy method for deriving concentration inequalities 
for functions of many independent random variables, and it also exhibits its multiple connections to 
information theory. The basic ingredients of the entropy method are discussed first in conjunction with 
the closely related topic of logarithmic Sobolev inequalities, which are typical of the so-called functional 
approach to studying concentration of measure phenomena. The discussion on logarithmic Sobolev 
inequalities is complemented by a related viewpoint based on probability in metric spaces. This viewpoint 
centers around the so-called transportation-cost inequalities, whose roots are in information theory. Some 
representative results on concentration for dependent random variables are briefly summarized, with 
emphasis on their connections to the entropy method. Finally, the tutorial addresses several applications 
of the entropy method and related information-theoretic tools to problems in communications and coding. 
These include strong converses for several source and channel coding problems, empirical distributions 
of good channel codes with non-vanishing error probability, and an information-theoretic converse for 
concentration of measure. 
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Chapter 1 

Introduction 



Concentration of measure inequalities provide bounds on the probability that a random variable X 
deviates from its expected value, median or other typical value by a given quantity. These inequalities 
have been studied for several decades, with some fundamental and substantial contributions to their 
study during the last two decades. Very roughly speaking, the concentration of measure phenomenon 
can be stated in the following simple way: "A random variable that depends in a smooth way on many 
independent random variables (but not too much on any of them) is essentially constant" [lj. The exact 
meaning of such a statement clearly needs to be clarified rigorously, but it often means that such a 
random variable X concentrates around x in a way that the probability of the event {\X — x\ > i] (for 
some t > 0) decays exponentially in t. Detailed treatments of the concentration of measure phenomenon, 
including historical accounts, can be found, e.g., in [2], [3], [I], [5], [6] and [7j. 

In recent years, concentration inequalities have been intensively studied and used as a powerful tool in 
various areas. These include convex geometry, functional analysis, statistical physics, statistics, dynamical 
systems, pure and applied probability (random matrices, Markov processes, random graphs, percolation), 
information theory, coding theory, learning theory and randomized algorithms. Several techniques have 
been developed so far to prove concentration of measure phenomena. These include: 

• The martingale approach (see, e.g., [6j E], [HE Chapter 7], [TTJ Q2]), and its information-theoretic 
applications (see, e.g., [13] and references therein, [H]). This methodology will be covered in Chapter 
which is focused on concentration inequalities for discrete-time martingales with bounded jumps, and 
on some of their potential applications in information theory, coding and communications. A recent 
interesting avenue that follows from the martingale-based inequalities that are introduced in this chapter 
is their generalization to random matrices (see, e.g., [15] and [16]). 

• The entropy method and logarithmic Sobolev inequalities (see, e.g., [31 Chapter 5], [1] and references 
therein), and their information-theoretic aspects. This methodology and its remarkable information- 
theoretic links will be considered in Chapter [3] 

• Transportation-cost inequalities that originated from information theory (see, e.g., O Chapter 6], [17] . 
and references therein). This methodology and its information-theoretic aspects will be considered in 
Chapter [3l with a discussion of the relation between transportation-cost inequalities to the entropy 
method and logarithmic Sobolev inequalities. 

• Talagrand's inequalities for product measures (see, e.g., [I], [U Chapter 4], j!7] and [T8J Chapter 6]) and 
their link to information theory [19]. These inequalities proved to be very useful in combinatorial appli- 
cations such as the common/ increasing subsequence, in statistical physics applications and in functional 
analysis. We do not discuss Talagrand's inequalities in detail. 

• Stein's method is recently used to prove concentration inequalities, a.k.a. concentration inequalities with 
exchangeable pairs (see, e.g., [20], [21], [22] and [23]). This framework is not addressed in this paper. 
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• Concentration inequalities that follow from rigorous methods in statistical physics (see, e.g., [24" j [25 | l2"6 j 
E7J [28] G9j EQl EI]). These methods are not addressed either in this tutorial paper. 

• The so called reverse Lyapunov inequalities were recently used to derive concentration inequalities for 
multi-dimensional log-concave distributions [32] (see also a related work in [33J). The concentration 
inequalities in |32] imply an extension of the Shannon-McMillan-Breiman strong ergodic theorem to the 
class of discrete-time processes with log-concave marginals. This approach is not addressed here either. 

We now give a synopsis of some of the main ideas underlying the martingale approach (Chapter [2]) 
and the entropy method (Chapter [3]). Let / : R n — > R be a function that is characterized by bounded 
differences whenever any two n-dimensional input vectors differ in only one coordinate. A common 
method for proving concentration of such a function of n independent RVs, around the expected value 
E[/], is called McDiarmid's inequality or the "independent bounded-differences inequality" [6j. This 
inequality was proved (with some possible extensions) via the martingale approach. Although the proof 
of this inequality has some similarity to the proof of the Azuma-Hoeffding inequality, the former inequality 
is stated under a condition which provides an improvement by a factor of 4 in the exponent. Some of its 
nice applications to algorithmic discrete mathematics were exemplified in, e.g., [6J Section 3]. 

The Azuma-Hoeffding inequality is by now a well-known methodology that has been often used to 
prove concentration phenomena for discrete-time martingales whose jumps are bounded almost surely. 
It is due to Hoeffding [9] who proved this inequality for a sum of independent and bounded random 
variables, and Azuma [8] later extended it to bounded-difference martingales. 

The use of the Azuma-Hoeffding inequality was introduced to the computer science literature in |34| 
in order to prove concentration, around the expected value, of the chromatic number for random graphs. 
The chromatic number of a graph is defined to be the minimal number of colors that is required to color 
all the vertices of this graph so that no two vertices which are connected by an edge have the same 
color, and the ensemble for which concentration was demonstrated in |34j was the ensemble of random 
graphs with n vertices such that any ordered pair of vertices in the graph is connected by an edge with 
a fixed probability p for some p G (0, 1) and the status of each edge (present or absent) is independent 
of all other edges (it is called an Erdos-Renyi ensemble). It is noted that the concentration result in |34j 
was established without knowing the expected value over this ensemble. The migration of this bounding 
inequality into coding theory, especially for exploring some concentration phenomena that are related to 
the analysis of codes defined on graphs and iterative message-passing decoding algorithms, was initiated 
in |35| . |36j and [37]. During the last decade, the Azuma-Hoeffding inequality has been extensively 
used for proving concentration of measure in coding theory (see, e.g., |13j and references therein). In 
general, all these concentration inequalities serve to justify theoretically the ensemble approach of codes 
defined on graphs. However, much stronger concentration phenomena are observed in practice. The 
Azuma-Hoeffding inequality was also recently used in [38] for the analysis of probability estimation in 
the rare-events regime where it was assumed that an observed string is drawn i.i.d. from an unknown 
distribution, but the alphabet size and the source distribution both scale with the block length (so the 
empirical distribution does not converge to the true distribution as the block length tends to infinity). It 
is noted that the Azuma-Hoeffding inequality for a bounded martingale-difference sequence was extended 
to "centering sequences" with bounded differences [39]; this extension provides sharper concentration 
results for, e.g., sequences that are related to sampling without replacement. In [2D], [H] and [12], the 
martingale approach was also used to derive achievable rates and random coding error exponents for 
linear and non- linear additive white Gaussian noise channels (with or without memory). 

However, as pointed out by Talagrand [TJ, "for all its qualities, the martingale method has a great 
drawback: it does not seem to yield results of optimal order in several key situations. In particular, it 
seems unable to obtain even a weak version of concentration of measure phenomenon in Gaussian space." 
In Chapter [3] of this tutorial, we focus on another set of techniques, fundamentally rooted in information 
theory, that provide very strong concentration inequalities. These techniques, commonly referred to as 
the entropy method, have originated in the work of Michel Ledoux [43], who found an alternative route 
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to a class of concentration inequalities for product measures originally derived by Talagrand [7] using 
an ingenious inductive technique. Specifically, Ledoux noticed that the well-known Chernoff bounding 
trick, which is discussed in detail in Section [3. II and which expresses the deviation probability of the form 
¥(\X — x\ > t) (for an arbitrary t > 0) in terms of the moment-generating function (MGF) E[exp(AX)], 
can be combined with the so-called logarithmic Sobolev inequalities, which can be used to control the 
MGF in terms of the relative entropy. 

Perhaps the best-known log-Sobolev inequality, first explicitly referred to as such by Leonard Gross 
[44], pertains to the standard Gaussian distribution in Euclidean space M n , and bounds the relative 
entropy D(P\\G n ) between an arbitrary probability distribution P on R n and the standard Gaussian 
measure G n by an "energy-like" quantity related to the squared norm of the gradient of the density of P 
w.r.t. G n (here, it can be assumed without loss of generality that P is absolutely continuous w.r.t. G n , 
for otherwise both sides of the log-Sobolev inequality are equal to +oo). By a clever analytic argument 
which he attributed to an unpublished note by Ira Herbst, Gross has used his log-Sobolev inequality to 
show that the logarithmic MGF A(A) = lnE[exp(A£/)] of U = f{X n ), where X n ~ G n and / : R n -> K is 
any sufficiently smooth function with ||V/|| < 1, can be bounded as A(A) < A 2 /2. This bound then yields 
the optimal Gaussian concentration inequality P (|/(X n ) - E[f(X n )}\ > t) < 2exp (-t 2 /2) for X n ~ G n . 
(It should be pointed out that the Gaussian log-Sobolev inequality has a curious history, and seems to 
have been discovered independently in various equivalent forms by several people, e.g., by Stam [15] in 
the context of information theory, and by Federbush [46j in the context of mathematical quantum field 
theory. Through the work of Stam |45j . the Gaussian log-Sobolev inequality has been linked to several 
other information-theoretic notions, such as concavity of entropy power [471 1481 149].) 

In a nutshell, the entropy method takes this idea and applies it beyond the Gaussian case. In abstract 
terms, log-Sobolev inequalities are functional inequalities that relate the relative entropy between an 
arbitrary distribution Q w.r.t. the distribution P of interest to some "energy functional" of the density 
/ = dQ/dP. If one is interested in studying concentration properties of some function U = f{Z) with 
Z ~ P, the core of the entropy method consists in applying an appropriate log-Sobolev inequality to the 
tilted distributions p( x ^ with d-P^-^/d-P oc exp(A/). Provided the function / is well-behaved in the sense 
of having bounded "energy," one uses the "Herbst argument" to pass from the log-Sobolev inequality 
to the bound lnE[exp(Ai7)] < cA 2 /(2C), where c > depends only on the distribution P, while C > 
is determined by the energy content of /. While there is no general technique for deriving log-Sobolev 
inequalities, there are nevertheless some underlying principles that can be exploited for that purpose. We 
discuss some of these principles in Chapter [3l More information on log-Sobolev inequalities can be found 
in several excellent monographs and lecture notes [31 El EH EH E2] , as well as in [531 EU E51 EH EZ] and 
references therein. 

Around the same time as Ledoux first introduced the entropy method in [13], Katalin Marton has 
shown in a breakthrough paper [58] that to prove concentration bounds one can bypass functional in- 
equalities and work directly on the level of probability measures. More specifically, Marton has shown 
that Gaussian concentration bounds can be deduced from so-called transportation- cost inequalities. These 
inequalities, discussed in detail in Section 13.41 relate information-theoretic quantities, such as the rela- 
tive entropy, to a certain class of distances between probability measures on the metric space where the 
random variables of interest are defined. These so-called Wasserstein distances have been the subject 
of intense research activity that touches upon probability theory, functional analysis, dynamical systems 
and partial differential equations, statistical physics, and differential geometry. A great deal of informa- 
tion on this field of optimal transportation can be found in two books by Cedric Villani — [59] offers a 
concise and fairly elementary introduction, while a more recent monograph [60j is a lot more detailed and 
encyclopedic. Multiple connections between optimal transportation, concentration of measure, and infor- 
mation theory are also explored in [171 [191 EH EH E31 EH EH] . (We also note that Wasserstein distances 
have been used in information theory in the context of lossy source coding [661 167].) 

The first explicit invocation of concentration inequalities in an information-theoretic context appears 
in the work of Ahlswede et al. [681 169] , These authors have shown that a certain delicate probabilistic 
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inequality, which was referred to as the "blowing up lemma," and which we now (thanks to the contribu- 
tions by Marton [58} 170]) recognize as a Gaussian concentration bound in Hamming space, can be used to 
derive strong converses for a wide variety of information-theoretic problems, including some multitermi- 
nal scenarios. The importance of sharp concentration inequalities for characterizing fundamental limits 
of coding schemes in information theory is evident from the recent flurry of activity on finite-blocklength 
analysis of source and channel codes (see, e.g., [7U E2 E3J EH [75j EBJ [771 ES] ) ■ Thus, it is timely to revisit 
the use of concentration-of-measure ideas in information theory from a modern perspective. We hope 
that our treatment, which above all aims to distill the core information-theoretic ideas underlying the 
study of concentration of measure, will be helpful to researchers in information theory and related fields. 

1.1 A reader's guide 

This tutorial is mainly focused on the interplay between concentration of measure and information theory, 
followed by some of their applications in problems related to information theory, communications and 
coding. For this reason, it is primarily aimed at serving researchers and graduate students in information 
theory, communications and coding. The mathematical background that is needed for this tutorial is real 
analysis, elementary functional analysis, and a first graduate course in probability theory and stochastic 
processes. As a refresher textbook for this mathematical background, the reader is referred, e.g., to [79J. 

Chapter [2] on the martingale approach is structured as follows: Section 12.11 presents briefly discrete- 
time (sub/ super) martingales, Section l2T2l presents some basic inequalities that are widely used for proving 
concentration inequalities via the martingale approach. Section 12.31 derives some refined versions of the 
Azuma-Hoeffding inequality, and it considers interconnections between these concentration inequalities. 
Section [2~i1 introduces Freedman's inequality with a refined version of this inequality, and these inequalities 
are specialized to get concentration inequalities for sums of independent and bounded random variables. 
Section 12.51 considers some connections between the concentration inequalities that are introduced in 
Section T2.3I to the method of types, a central limit theorem for martingales, the law of iterated logarithm, 
the moderate deviations principle for i.i.d. real-valued random variables, and some previously reported 
concentration inequalities for discrete-parameter martingales with bounded jumps. Section \2 . 61 forms the 
second part of this work, applying the concentration inequalities from Section 12.31 to information theory 
and some related topics. Chapter [2] is summarized briefly in Section [2.71 

There have been so far very nice surveys on concentration inequalities via the martingale approach 
that include [6], [lOl Chapter 11], [Til Chapter 2] and [12]. The main focus of Chapter [2] is on the 
presentation of some old and new concentration inequalities that are based on the martingale approach, 
with an emphasis on some of their potential applications in information and communication-theoretic 
aspects. This makes the presentation in this chapter different from these aforementioned surveys. 

Chapter [3] on the entropy method is structured as follows: Section \3. II introduces the main ingredients 
of the entropy method and sets up the major themes that recur throughout the chapter. Section 13.21 
focuses on the logarithmic Sobolev inequality for Gaussian measures, as well as on its numerous links 
to information-theoretic ideas. The general scheme of logarithmic Sobolev inequalities is introduced in 
Section 13.31 and then applied to a variety of continuous and discrete examples, including an alternative 
derivation of McDiarmid's inequality that does not rely on martingale methods and recovers the correct 
constant in the exponent. Thus, Sections 13.21 and 13.31 present an approach to deriving concentration 
bounds based on functional inequalities. In Section 13.41 concentration is examined through the lens of 
geometry in probability spaces equipped with a metric. This viewpoint centers around intrinsic properties 
of probability measures, and has received a great deal of attention since the pioneering work of Marton 
[70} 158] on transportation-cost inequalities. Although the focus in Chapter [3] is mainly on concentration 
for product measures, Section T3.5I contains a brief summary of a few results on concentration for functions 
of dependent random variables, and discusses the connection between these results and the information- 
theoretic machinery that has been the subject of the chapter. Several applications of concentration to 
problems in information theory are surveyed in Section 13.61 



Chapter 2 



Concentration Inequalities via the 
Martingale Approach and their 
Applications in Information Theory, 
Communications and Coding 



This chapter introduces some concentration inequalities for discrete-time martingales with bounded in- 
crements, and it exemplifies some of their potential applications in information theory and related topics. 
The first part of this chapter introduces some concentration inequalities for martingales that include the 
Azuma-Hoeffding, Bennett, Freedman and McDiarmid inequalities. These inequalities are also special- 
ized for sums of independent and bounded random variables that include the inequalities by Bernstein, 
Bennett, Hoeffding, and Kearns &; Saul. An improvement of the martingale inequalities for some sub- 
classes of martingales (e.g., the conditionally symmetric martingales) is discussed in detail, and some new 
refined inequalities are derived. The first part of this chapter also considers a geometric interpretation 
of some of these inequalities, providing an insight on the inter-connections between them. The second 
part of this chapter exemplifies the potential applications of the considered martingale inequalities in the 
context of information theory and related topics. The considered applications include binary hypothesis 
testing, concentration for codes defined on graphs, concentration for OFDM signals, and a use of some 
martingale inequalities for the derivation of achievable rates under ML decoding and lower bounds on 
the error exponents for random coding over some linear or non-linear communication channels. 



2.1 Discrete-time martingales 
2.1.1 Martingales 

This subsection provides a brief review of martingales to set definitions and notation. We will not need for 
this chapter any result about martingales beyond the definition and the few basic properties mentioned 
in the following. 

Definition 1. [Discrete-time martingales] Let (fi, J 7 , P) be a probability space, and let n G N. A sequence 
{Xi,Ti}f =0 , where the AYs are random variables and the JYs are cr-algebras, is a martingale if the 
following conditions are satisfied: 

1. J-"o C T\ C . . . C T n is a sequence of sub a-algebras of T (the sequence {Ti}f =0 is called a filtration); 
usually, = {0, Q} and T n = T . 
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2. X{ G L 1 (fi, J-i,F) for every i G {0, . . . ,n}; this means that each Xi is defined on the same sample space 
Q, it is J^-measurable, and E[|Xj|] = j n |Xj(o;)|P(da;) < oo. 

3. For all i G {1, . . . , n}, the equality X^\ = E[Xj\J-i—\\ holds almost surely (a.s.). 

Remark 1. Since {J-i}f =Q forms a filtration, then it follows from the tower principle for conditional 
expectations that (a.s.) 

X j =E[X i \T j ], Vi>j. 

Also for every i 6 N, EpQ] = E[E[Xj|.Fj_i]] = EpQ_i], so the expectation of a martingale sequence is 
fixed. 

Remark 2. One can generate martingale sequences by the following procedure: Given a RV X G 
L 1 (r2, J 7 , P) and an arbitrary filtration of sub cr-algebras {J-i}f =0 , let 

X i = E[X\J r i \, V* G {0,1,... n}. 

Then, the sequence Xo,X\, . . . , X n forms a martingale (w.r.t. the above filtration) since 

1. The RV Xi = E[X\Fi] is J^-measurable, and also E[|X<|] < E[|X|] < oo. 

2. By construction {.Fj}£_0 is a filtration. 

3. For every i G {1, . . . , n} 

E[X i \^ 1 ]=E[E[X\T i ]\T i - 1 ] 

= E[X|J- i _ 1 ] (since C j;) 
= a.s. 

Remark 3. In continuation to Remark [21 the setting where J-q = {0, 0} and J- n = J- gives that 
Xq, Xi, . . . , X n is a martingale sequence with 

X = E[X\T ] = E[X], X n = E[X\F n ] = X a.s.. 

In this case, one gets a martingale sequence where the first element is the expected value of X, and 
the last element is X itself (a.s.). This has the following interpretation: at the beginning, one doesn't 
know anything about X, so it is initially estimated by its expected value. At each step, more and more 
information about the random variable X is revealed until its value is known almost surely. 

Example 1. Let {Uk}^ = i be independent random variables on a joint probability space (f2,.F, P), and 
assume that E[Uk] = and E[|E/fc|] < oo for every k. Let us define 

k 

Xk = J2 U i> Vfc€{l,...,n} 

with Xq = 0. Define the natural filtration where J-q = {0,r2}, and 

■Fk = cr(Xi, . . . , Xk) 

= a(Ui,...,U k ), Vk€{l,...,n}. 

Note that = a(X\, . . . ,Xf~) denotes the minimal c-algebra that includes all the sets of the form 
[uj G SI : (Xi(uj) < a\, . ■ ■ , Xk(cj) < a^)} where ctj £ KU {— oo, +00} for j G {1, . . . , k}. It is easy to 
verify that {X/-, J-k}k=n ^ s a martingale sequence; this simply implies that all the concentration inequalities 
that apply to discrete-time martingales (like those introduced in this chapter) can be particularized to 
concentration inequalities for sums of independent random variables. 
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2.1.2 Sub/ super martingales 

Sub and super martingales require the first two conditions in Definition [JJ and the equality in the third 
condition of Definition [1] is relaxed to one of the following inequalities: 

E[-Xi|.Fj_i] > Xi-i holds a.s. for sub-martingales. 

Ef-Xjl-Fi-i] < holds a.s. for super-martingales. 

Clearly, every random process that is both a sub and super-martingale is a martingale, and vise versa. 
Furthermore, {Xi,J~i} is a sub-martingale if and only if {— Xj, Ti} is a super-martingale. The following 
properties are direct consequences of Jensen's inequality for conditional expectations: 

If {Xi,^Fi} is a martingale, h is a convex (concave) function and E[|/i(Xj)|] < oo, then {/t(Xj), Ti\ is a 
sub (super) martingale. 

If {Xi,J-i} is a super-martingale, h is monotonic increasing and concave, and E[|/i(Xj)|] < oo, then 
{/t(Xj),J-"j} is a super-martingale. Similarly, if {Xi,J~i} is a sub-martingale, h is monotonic increasing 
and convex, and E[|/i(Xj)|] < oo, then {/i(Xj), .Fj} is a sub-martingale. 

Example 2. if {Xj,.^} is a martingale, then {I^Ql,^} is a sub-martingale. Furthermore, if Xj € 
L 2 (f2, J-i,¥) then also {Xf, !Fi\ is a sub-martingale. Finally, if {Xi, J^} is a non-negative sub-martingale 
and X, G L 2 (fi, J^jP) then also {X 2 , Fj} is a sub-martingale. 

2.2 Basic concentration inequalities via the martingale approach 

In the following section, some basic inequalities that are widely used for proving concentration inequalities 
are presented, whose derivation relies on the martingale approach. Their proofs convey the main concepts 
of the martingale approach for proving concentration. Their presentation also motivates some further 
refinements that are considered in the continuation of this chapter. 

2.2.1 The Azuma-Hoeffding inequality 

The Azuma-Hoeffding inequalitjQ is a useful concentration inequality for bounded-difference martingales. 
It was proved in [9] for independent bounded random variables, followed by a discussion on sums of 
dependent random variables; this inequality was later derived in [8] for the more general setting of 
bounded-difference martingales. In the following, this inequality is introduced. 

Theorem 1. [Azuma-Hoeffding inequality] Let {X&, ^Fk\k=o ' 3e a discrete-parameter real- valued 
martingale sequence. Suppose that, for every k G {1, . . . ,n}, the condition |X& — X&_i| < dk holds a.s. 
for a real-valued sequence {c4}£ =1 of non-negative numbers. Then, for every a > 0, 



The proof of the Azuma-Hoeffding inequality serves also to present the basic principles on which the 
martingale approach for proving concentration results is based. Therefore, we present in the following 
the proof of this inequality. 

The Azuma-Hoeffding inequality is also known as Azuma's inequality. Since it is referred numerous times in this chapter, 
it will be named Azuma's inequality for the sake of brevity. 




(2.1) 



12 



CHAPTER 2. THE MARTINGALE APPROACH AND APPLICATIONS 



Proof. For an arbitrary a > 0, 

P(\X n - X \ >a)= F(X n -X >a) + F(X n - X < -a). 



(2.2) 



Let £j = Xi — Xi-i for i = 1, . . . , n designate the jumps of the martingale sequence. Then, it follows by 
assumption that |^| < 4 and E[£j. | Tk-i] = a.s. for every k G {1, . . . , n}. 
From Chernoff's inequality, 

P(*n " X > a) 

= p(X>>«) 



< e 



\j=i 



cxp 



*E* 

V 1=1 



vt > o. 



(2.3) 



Furthermore, 



E 



exp 



= E 



= E 



*E& 

fc=i 

exp 



*E& 

fc=l 

expl^^jE [exp(t£„) | J" n _i] 
^ k=i ' 



(2.4) 



where the last equality holds since Y = exp(t^^}^) is .F n _i-measurable; this holds due to fact that 
£fc = X). — Xk^i is J-fc-measurable for every fcsN, and C ^" n _i for < < n — 1 since {^fcj^n is a 
filtration. Hence, the RV Y^k=\ £fc an< ^ ^ are both .F n _i-measurable, and E[XY"|^" n _i] = Y"E[X|^" n _i]. 

Due to the convexity of the exponential function, and since < d^, then the straight line connecting 
the end points of the exponential function is above this function over the interval [—4,4]- Hence, for 
every k (note that E[^ | Tk-i] = 0), 



E[e«*|^fc-i] 

(4 + tk)e td « + (4 - tk)e- td « 



< E 
1 



2dk 



(e tdk +e~ tdk ) 



= cosh(t4)- 



(2.5) 



Since, for every integer m > 0, 



(2m)! > (2m)(2m - 2) . . . 2 = 2 m ml 
then, due to the power series expansions of the hyperbolic cosine and exponential functions, 



cosh(i4) = V 

n ( 2m )' 

m=0 v ' 



( tdfc )2m < - {tdk) 2m 

- 2 m m! 



= e 2 



m=0 



which therefore implies that 



E[e^\T k -i] < e~ 



2 w2 

k. 
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Consequently, by repeatedly using the recursion in (|2.4j) . it follows that 

f 2 j2\ / + 2 ™ 



E 



ex p(*z^) ^ ri ex p(^) =ex p [-2 it d i) 

^ fc=i ' -i fc=i ^ ' V fe=i / 

which then gives (see (|2.3|) ) that 

P(X„ - X > a) < exp -at + — ^ 4 > Vi > 0. 

An optimization over the free parameter t > gives that t = a (X]fc=i 4) > anc ^ 

P(X n - A > a) < exp f- f d2 ) . (2.6) 

Since, by assumption, {Xfc, Fk} is a martingale with bounded jumps, so is {— Xk, Tk} (with the same 
bounds on its jumps). This implies that the same bound is also valid for the probability F(X n — X < —a) 
and together with (|2.2|) it completes the proof of Theorem [TJ □ 

The proof of this inequality will be revisited later in this chapter for the derivation of some refined 
versions, whose use and advantage will be also exemplified. 

Remark 4. In [6, Theorem 3.13], Azuma's inequality is stated as follows: Let {Yk,^Fk}^ =0 be a martingale- 
difference sequence with Yq = (i.e., Yfc is J^-measurable, E[|Yfc|] < oo and E[Y&| Tk-i] = a.s. for 
every k S {l,...,n}). Assume that, for every k, there exist some numbers ak,b k € R such that a.s. 
ijfc < ifc < Then, for every r > 0, 



En 

k=l 



>'■)< 2 exp ( - ^ ^). (2.7) 



As a consequence of this inequality, consider a discrete-parameter real- valued martingale sequence {X k , J-fc}^ =0 
where < — X k -\ < a.s. for every k. Let = X k — X^-i for every k G {1, ... , n}, so since 
{Y/c, J^j^Q is a martingale-difference sequence and 2~^fc=i ^k = — X~o, then 

P (l*n ~ X | > r) < 2 exp (- ff Vr>0. (2.8) 



Example 3. Let {Y}^ be i.i.d. binary random variables which get the values ±d, for some constant 
d > 0, with equal probability. Let = Yli=o ^ ^ or ^ e {0, 1, . . . , } , and define the natural filtration 
^0 C Ji C J 2 • • • where 

J- fe = <r(Y ,...,n), Vfc€{0,l,... } } 

is the a-algebra that is generated by the random variables Yq, . . . , Y k . Note that J^I^Lq is a martin- 
gale sequence, and (a.s.) \X k — X k -i\ = = d, \/k 6 N. It therefore follows from Azuma's inequality 
that 

P(|X n - X \ > aV^) < 2 exp (~^J • ( 2 -9) 

for every a > and n £ N. Prom the central limit theorem (CLT), since the RVs {Yi}^ are i.i.d. 
with zero mean and variance d 2 , then -^(X n — Xq) = Yuk=i ^fc converges in distribution to A/"(0, <i 2 ). 
Therefore, for every a > 0, 

lim P(|X n -X | >aVn) = 2Q(^) (2.10) 

n->oo Vet/ 
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If 00 ( t 2 



Q{x) = —= / exp -— )dt, VxGR (2.11) 
\/2tt J x V 2 / 



where 



is the probability that a zero- mean and unit- variance Gaussian RV is larger than x. Since the following 
exponential upper and lower bounds on the Q-function hold 

-=— ^ T -^-e~^r <Q( X ) < -=^-e-£, Vx > (2.12) 

then it follows from (12. 10h that the exponent on the right-hand side of (12.9f) is the exact exponent in this 
example. 

Example 4. In continuation to Example[3l let 7 G (0, 1], and let us generalize this example by considering 
the case where the i.i.d. binary RVs {^i}^ have the probability law 

F(Yi = +d) = -4-, HYi = - 7 d) 



1+7 1+7 

Hence, it follows that the i.i.d. RVs {Yi} have zero mean and variance a 2 = 7 e? 2 as in Example El Let 
{XfcjJ^j^Q be defined similarly to Example [3j so that it forms a martingale sequence. Based on the 
CLT, ^ (X n — Xq) = ^ Sfc=i ^fe converges weakly to J\f(0,-yd 2 ), so for every a > 

lim P(|X„-X | >ax/^) = 2Q(-^-Y (2.13) 

n->oo y ^ 7 d J 



From the exponential upper and lower bounds of the Q-function in (12. 12ft . the right-hand side of (j2. 13[) 

-A, ■ • 1 

scales exponentially like e 2 ~< d . Hence, the exponent in this example is improved by a factor - as 
compared Azuma's inequality (that is the same as in Example [3] since \Xk — Xj e —x\ < d for every k G N). 
This indicates on the possible refinement of Azuma's inequality by introducing an additional constraint 
on the second moment. This route was studied extensively in the probability literature, and it is the 
focus of Section I 



2.2.2 McDiarmid's inequality 

The following useful inequality is due to McDiarmid ( \39\ Theorem 3.1] or [80]), and its original derivation 
uses the martingale approach for its derivation. We will relate, in the following, the derivation of this 
inequality to the derivation of the Azuma-Hoeffding inequality (see the preceding subsection) . 

Theorem 2. [McDiarmid's inequality] Let {Xk}^ =1 be independent real-valued random variables, 
taking values in the set X = JJje=i ^ Let g : X — > R be a measurable function such that, for some 
constants {dfc}5J =1 , 

\g(x) - g(x!)\ < d h , Vfc€{l,...,n} (2.14) 

where x = (x\, . . . , a^-i, Xfc+i, . . . , x n ) and 1' = (x\, . . . , x^-i, x' k , Xk+ii ■ ■ ■ > x n) are two arbitrary 
points in the set X that may only differ in their k-th coordinate (this is equivalent to saying that the 
variation of the function g w.r.t. its A;-th coordinate is upper bounded by df~). Then, for every a > 0, 

P(|ff(Xx, ■ ■ ■ , X n ) — E [g{X u . ..,X n )]\>a)<2 exp (- ^ ) . (2.15) 



Remark 5. One can use the Azuma-Hoeffding inequality for a derivation of a concentration inequality in 
the considered setting. However, the following proof provides in this setting an improvement by a factor 
of 4 in the exponent of the bound. 
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Proof. For k G {1, . . . , n}, let Tk = o~{X\, . . . , Xk) be the a- algebra that is generated by X\, . . . , Xk with 
T = {0,^}. Define 

£fc — E [g(Xi ,X n )\ Tk] -E[g{X u ...,X n )\T k - 1 ], V A; € {1, ... , n). (2.16) 
Note that .Fq C T\ . . . C T n is a filtration, and 

E[<7(Xl, . . . ,X n ) I Jo] = E[ 5 (Xi, . . . , X n )] 

E[ 5 (X l5 ...,X n )| J- n ] = 5 (X 1 ,...,X n ). (2.17) 
Hence, it follows from the last three equalities that 

n 

g(X u ...,X n ) -E[g(X 1 , . . . ,X n )] = J^tk- 

k=l 

In the following, we need a lemma: 

Lemma 1. For every k G {1, . . . , n}, the following properties hold a.s.: 

1. E[^fc | J-fe-i] = 0, so {^fejJfc} is a martingale-difference and is J-fc-measurable. 

2. |4| < 4 

3. ^ € [Ak, Ak + dfc] where Ak is some non-positive and J-fc_i-measurable random variable. 

Proof. The random variable is J-^-measurable since Jjfc_i C J 7 ^., and is a difference of two functions 
where one is J-^-measurable and the other is J^i-measurable. Furthermore, it is easy to verify that 
E[£fc | J~k-i] = 0. This proves the first item. The second item follows from the first and third items. To 
prove the third item, note that = fk{X\, . . . , Xk) holds a.s. for some function fk : X\ x . . . x Xk — > M 
that is Jfc-measurable. Let us define, for every k 6 {1, . . . , n), 

A = inf f k (Xi,...,X k -i,x), 

xex k 

B k = sup f k (X 1 , X k -i,x) 

x£X k 

which are J-/ c _i-measurable, and by definition G L4fc,_Bfc] holds almost surely. Furthermore, for every 
point (xi, . . . , Xfc_i) G X\ X . . . x Xk-x, we have 

sup / fc (xi,...,x fc _i,x) - inf f k (xi,...,Xk-i,x') 

xex k x ^ x k 

= sup {f k (xi, . . . ,x k -i,x) - f k (xi, . . . ,x k -i,x')} 

x,x'aX k 



SUp jE[flr(Xl, . . .,X n ) \X 1 =X l ,.. .,X k -l = X k -l,X k = x] 

■,x'ex k l 

-E[g(Xx,. . .,X n ) \Xx=xx,... ,X k -i = x k -i,X k = a/]} 

sup {E[p(xi, . . . ,x k -i,x,X k+ i,. ■ -,X n )] -E[g(x 1} .. . , x k -i, x' , X k+1 , . . . ,X n )]\ (2.18) 
x,x'ex k 1 } 

sup jE[p(xi, . . . ,x k -i,x,Xk+i, ■ ■ .,X n ) - g(xi, . . . ,x k -i,x',Xk+i, ■ ■ ■ ,X n )] \ 

x,x'£X k 1 J 
< d k (2.19) 



where (|2.18p follows from the independence of the random variables {Xk}^ =1 , and (|2.19|) follows from the 
condition in (|2.14p . Hence, it follows that Bk — Ak < dk a.s., which then implies that G [Ak, Ak + dk]. 
Since E[£& | J-k—i] = then a.s. the J-fc_i-measurable function Ak is non-positive. It is noted that the 
third item of the lemma makes it different from the proof of the Azuma-Hoeffding inequality (in that 
case, it implies that G [—dk, dk] where the length of the interval is twice larger.) □ 
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Since E[£fc | T k -\\ = and G [Ak, Ak + d k ] with ^4fc < and J^-i-measurable, then 

Var(£ fc | 7jfc_i) < -A k {A k + 4) 4 a 2 . 

Applying the convexity of the exponential function gives (similarly to the derivation of the Azuma- 
Hoeffding inequality, but this time w.r.t. the interval [A k , A k + d k ]) implies that for every k G {1, . . . , n} 



< E 



(6 - AjQe^+'k) + + A k + d fc )e M * 
<4 



Let Pi 



dk 



G [0, 1], then 



4 



E[e'« fc | 

< p fce *( A fc+ d fc) + (1 - P k )e 
= e tA *(l-P k + P k e td *) 



tA k 



(2.20) 



where 



H k (t)^tA k + ln(l-P k + P k e td "), Vi G 



(2.21) 

Since H k {0) = H' k (0) = and the geometric mean is less than or equal to the arithmetic mean then, for 
every t, 



H' k '(t) 

which implies by Taylor's theorem that 



d 2 k P k (l-P k )e td « < d| 



(l-P k +P k e td *Y - 4 



H k {t) < 



t 2 dt 



(2.22) 



so, from ([2~20jh 



E[e^ | J-fe-i] < e^r 



Similarly to the proof of the Azuma-Hoeffding inequality, by repeatedly using the recursion in (|2.4p . the 
last inequality implies that 



E 



cxp 



V k=l J J V k=X / 



(2.23) 



which then gives from (12. 3ft that, for every i > 0, 

P( 5 (Ax, . . . , X n ) - E[g(Xx ,...,X n )]>a) 



\k=l 



< exp i-at + - ^ d 



(2.24) 



k=l 



An optimization over the free parameter t > gives that t = 4a (X)fc=i 4) > so 

2a 2 



P( 5 (A 1 ,...,A n )-E[ 5 (A 1 ,...,A n )] > a) < exp 



V n r/ 2 
2^k=i a k 



(2.25) 
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By replacing g with —g, it follows that this bound is also valid for the probability 

¥(g(X 1 , ...,X n )- Eb(Xi, ...,X n )]< a) 
which therefore gives the bound in (I2.15P . This completes the proof of Theorem [21 □ 

2.2.3 Hoeffding's inequality, and its improved version (the Kearns-Saul inequality) 

In the following, we derive a concentration inequality for sums of independent and bounded random 
variables as a consequence of McDiarmid's inequality. This inequality is due to Hoeffding (see J9j Theo- 
rem 2]). An improved version of Hoeffding's inequality, due to Kearns and Saul [81] , is also introduced 
in the following. 



Theorem 3 (Hoeffding). Let {U k }^ =1 be a sequence of independent and bounded random variables such 
that, for every k E {1, . . . , n}, U k G [a k , b k ] holds a.s. for some constants a^, b k G R. Let fj, n = Ysk=i ^[Uk\- 
Then, 

>av^) <2expf- 2 " n Va>0. (2.26) 



k=l 



Mr 



Proof. Let g(u) = Ylk=i u k f° r every u G M n , so the variation of g w.r.t. its fc-th coordinate is equal to 
bk ~ a k- It therefore follows from McDiarmid's inequality that 

/ 2a 2 n \ 

P(\g(U 1 ,...,U n )-E[g(U 1 ,...,U n )}\ > cty/n) < 2exp (-—— , Va > 0. 

V l^k=A b k ~ a kY J 

The proof of (|235|t is completed by noticing that E\g(Uu U n )] = Y2=i E Pk] = A*n- □ 



An improved version of Hoeffding's inequality, due to Kearns and Saul |81| is introduced in the 
following. It is noted that a certain gap in the original proof of the improved inequality in [ST] was 
recently solved in |82j by some tedious calculus. A shorter information-theoretic proof of the basic 
inequality which is required for the derivation of the improved concentration result appears in the next 
chapter (see Section V-C of the next chapter). So, this basic inequality is only stated in the following, 
and it is used to derive the improved version of Hoeffding's inequality. 

To this end, let £ k = U k - E[U k ] for every k £ {1, . . . , n}, so Y2=i u k~ Vn = YZ=l & w ith E [g fc ] = 
and £ k E [a k — E[E/jfc], b k — E[U k ] ]. Following the argument that is used to derive inequality (|2.20p 



E[exp(t&)] 
where p k G [0, 1] is defined by 



< (1 -p k ) exp(t(a k -E[U k })) + p k exp(t(6 fc 
^exp(H k (t)) 



nuk})) 



Pk 



E[U k ] - a k 



VfcG {l,...,n}. 



b k — a k 

The derivation of McDiarmid's inequality (see (|2.22p ) gives that for all t G 



H k (t) < 



t 2 (bk 



a k ) 



The improvement of this bound (see 

H k (t) < { 



Theorem 4]) gives that for all t G 
if Pk 7^ | 



(l-2p k )(b k -ak) 2 t 2 

41n(^2t) 

V Pk I 

(b k -a k ) 2 t 2 



if Pk 



1 

2 1 



(2.27) 
(2.28) 

(2.29) 



(2.30) 
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Note that since 



lim 



l-2p 1 



so the upper bound in (|2.30p is continuous in p k , and it also improves the bound on H k (t) in (|2.29p 
unless Pk = \ (where both bounds coincide in this case). Prom (|2.30p . we have H k (t) < c k t 2 , for every 
k G {1, . . . , n} and t G R, where 



Cfc = < 



41n(l^) Yi Vk + - 2 

(b k -a k ) 2 



(2.31) 



\ip k = \. 

Hence, Chernoff's inequality and the similarity of the two one-sided tail bounds give 



^ U k - [l n 



k=l 



> a^Jn J < 2 exp(-ay/nt) E[exp(tf fc )] 
/ k=i 

< 2exp(-crf v / ra) • exp ( ^ c k t 2 ) , Vt > 0. 



(2.32) 



\k=l 



Finally, an optimization over the non-negative free parameter t leads to the following improved version 
of Hoeffding's inequality in [81] (with the recent follow-up in [82J ) . 



Theorem 4 (Kearns-Saul inequality). Let {U k } k=1 be a sequence of independent and bounded random 
variables such that, for every k G {1, . . . , n}, U k £ [a k ,b k ] holds a.s. for some constants a k ,b k € M. Let 
- £fc=iE[£/fc]. Then, 



E 17 * 

fc=l 



> a-v/n < 2 exp 



2 

a n 



Va > 0. 



(2.33) 



where {c k } k=1 is introduced in (|2.3ip with the p k s that are given in (12.28p . Moreover, the exponential 
bound ()2.33p improves Hoeffding's inequality, unless Pk = \ for every k G {1, . . . , n}. 



The reader is referred to another recent refinement of Hoeffding's inequality in [83] , followed by some 
numerical comparisons. 



2.3 Refined versions of the Azuma-Hoeffding inequality 

Example H] in the preceding section serves to motivate a derivation of an improved concentration inequality 
with an additional constraint on the conditional variance of a martingale sequence. In the following, 
assume that \X k — X k _i\ < d holds a.s. for every k (note that d does not depend on k, so it is a 
global bound on the jumps of the martingale). A new condition is added for the derivation of the next 
concentration inequality, where it is assumed that 

Var(X fc | F k _ x ) = E[(X k - X k ^) 2 \ F k _ x ] < jd 2 

for some constant 7 G (0, 1]. 



2.3. REFINED VERSIONS OF THE AZUMA-HOEFFDING INEQUALITY 
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2.3.1 A refinement of the Azuma-Hoeffding inequality for discrete-time martingales 
with bounded jumps 

The following theorem appears in [80] (see also [84\ Corollary 2.4.7]). 

Theorem 5. Let {Xk,J~k}k=o be a discrete-parameter real-valued martingale. Assume that, for some 
constants d, a > 0, the following two requirements are satisfied a.s. 



\Xh — X, 



k-l 



< d. 



Var(A fc | F k _ x ) = E[{X k - AVi) 2 \ F k ^] < a 2 
for every k 6 {1, . . . , n}. Then, for every a > 0, 



\X n - X \ > an) < 2exp -nD 



5 + 7 



1 + 7 



1 + 7 



where 



and 



cr 



7 



d 2 ' 



D(p\\q) = pln(-\ +(l-p)ln 



^ a 
= 1 

1 — p 



V P ,qe [0,1] 



(2.34) 

(2.35) 
(2.36) 

If 5 > 1, then the 



is the divergence between the two probability distributions (p, 1 — p) and (q,l — 
probability on the left-hand side of ()2.34p is equal to zero. 

Proof. The proof of this bound starts similarly to the proof of the Azuma-Hoeffding inequality, up to (|2.4p . 
The new ingredient in this proof is Bennett's inequality which replaces the argument of the convexity of 
the exponential function in the proof of the Azuma-Hoeffding inequality. We introduce in the following 
a lemma (see, e.g., [841 Lemma 2.4.1]) that is required for the proof of Theorem [5l 

Lemma 2 (Bennett). Let X be a real-valued random variable with x = E(A) and K[(X — x) 2 ] < a 2 for 
some a > 0. Furthermore, suppose that X < b a.s. for some b € M. Then, for every A > 0, 



E[. 



< 



(b — x) e b ~x + <r 2 e 



2 X(b-x) 



x) 2 + a 2 



(2.37) 



Proof. The lemma is trivial if A = 0, so it is proved in the following for A > 0. Let Y = X(X — x) for 
A > 0. Then, by assumption, Y < X(b — x) = by a.s. and Var(y) < A 2 cr 2 = a\. It is therefore required 
to show that if E[Y] = 0, Y < by, and Var(F) < o\, then 

i« y i s iwh) A + [A) ^ ^ 

2 

Let Yq be a random variable that gets the two possible values — ^- and by, where 

Y ° = -T L )=U^> nYo = by) = -^ (2.39) 

Oy J by + Gy by + Oy 

so inequality (|2.38p is equivalent to showing that 

E[e Y ] <E[e ¥o }. (2.40) 
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To that end, let <ft be the unique parabola where the function 



f(y) = Hv) - e y , Vy e r 

2 

is zero at y = fry, and f(y) = f'(y) = at y = — Since 0" is constant then f !f {y) = at exactly one 

2 

value of call it y$. Furthermore, since /(— = /(fry) (both are equal to zero) then f'{y) = for 

2 2 2 

some yi S (— ^,6y). By the same argument, applied to /' on [— -^-,yi], it follows that yo G (— 

The function / is convex on (—00, yo] (since, on this interval, / (y) = (ft" {y) — e y > (ft" {y) — e yo = 

2 

0"(2/o) ~~ eVo = f"(yo) = 0), and its minimal value on this interval is at y = (since at this point, /' 
is zero). Furthermore, / is concave on [yo, 00) and it gets its maximal value on this interval at y = y%. It 
implies that / > on the interval (—00, by], so E[/(Y)] > for any random variable Y such that Y < by 
a.s., which therefore gives that 

E[e y ] < E[(ft(Y)] 

with equality if F(Y £ {-^,by}) = 1. Since f"(y) > for y < y then cft"(y) - e" = f"{y) > 0, so 
(ft"(0) = (ft"{y) > (recall that (ft" is constant since (ft is a parabola). Hence, for any random variable Y 
of zero mean, E[/(Y)] which only depends on E[Y 2 ] is a non-decreasing function of E[Y 2 ]. The random 

2 

variable Yq that takes values in { — by} and whose distribution is given in ()2.39j) is of zero mean and 
variance E[Yq 2 ] = a Y , so 

E[<f>(Y)]<E[</>(Y )]. 



Note also that 



E[<ft(Y )] = E[e y o] 



since f(y) = (i.e., <ft{y) = e y ) if y = —-^ or by, and Yq only takes these two values. Combining the last 
two inequalities with the last equality gives inequality (|2.40p , which therefore completes the proof of the 
lemma. □ 



Applying Bennett's inequality in Lemma [2] for the conditional law of £, 
since E[^|J r fc_i] = 0, Var[£fc|.F&_;iJ < a 2 and < d a.s. for k £ N, 1 



the conditional law of given the cr-algebra J-k—i, 
< a 2 and < d a.s. for k € N, then a.s. 

a 2 exp(icf) +d 2 exp(-±zL) 

E [exp(i&) | J-fe-i] < ^ 2 ■ 2 V ^. (2.41) 

d z + <7 Z 



Hence, it follows from (|2.4p and (|2.4ip that, for every i > 0, 



E 



, n 

expf t^2& 

^ fc=l 




exp(id) + d 2 exp ^ 



d 2 + <7 2 



and, by induction, it follows that for every t > 




n-l 



expU^^fc 
^ fc=l ' 



E 



T / " M [ a 2 exp(td) + d 2 exp 

-p(tge fc ) <( , 2 + a2 

From the definition of 7 in (|2.35j) . 



d 2 + a 2 
this inequality is rewritten 



as 



E 



exp(t|>)] < (iMii^M)", vt>0. 



(2.42) 
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Let x = td (so x > 0). Combining Chernoff's inequality with (|2.42p gives that, for every a > (where 
from the definition of 5 in (I2.35D . at = 5x), 



F(X n -X > an) 



< exp(-ant) E 



exp 



fc=i 



< 



7 exp((l — 5)x) + exp(— (7 + <5)s 
1 + 7 



Vi > 0. 



(2.43) 



Consider first the case where 5 = 1 (i.e., a = d), then (|2.43p is particularized to 

7 + exp(-(7 + l)x) \ 



F(X n -X > dn) < 



1 + 7 



Vx > 



and the tightest bound within this form is obtained in the limit where x — > 00. This provides the 
inequality 



F(X n -X Q > dn) < 



7 



1 + 7 



(2.44) 



Otherwise, if 5 G [0, 1), the minimization of the base of the exponent on the right-hand side of (|2.43p 
w.r.t. the free non-negative parameter x yields that the optimized value is 



1 



1 + 7 



In 



7 + £ 
7(1 - S) 



(2.45) 



and its substitution into the right-hand side of (12.43P gives that, for every a > 0, 

¥(X n -X > an) 



< 



7 + (5\ 1+7 

' (1-5 ) !+"' 



7 

exp I — n 
exp I —nD 



7+_5 
1 + 7 
5 + 7 



In 



7 + <5\ (1-5 



ln(l - 5) 



1 + 7 



7 



1 + 7 



(2.46) 



and the exponent is equal to +00 if 5 > 1 (i.e., if a > d). Applying inequality (|2.46p to the martingale 
{— Xk-,Tk\kLQ gives the same upper bound to the other tail-probability P(X n — Xq < —an). The 
probability of the union of the two disjoint events {X n — Xq > an} and {X n — Xq < —an}, that is 
equal to the sum of their probabilities, therefore satisfies the upper bound in (I2.34p . This completes the 
proof of Theorem [5j □ 

Example 5. Let d > and e E (0, ^] be some constants. Consider a discrete-time real-valued martingale 
{Xk,J r k}'j? = Q where a.s. Xq = 0, and for every m £ N 



lP(^m — X m _i — d I T m -l) — £ 
ed 



X m X m — 1 



1-e 



This indeed implies that a.s. for every m £ N 

E[X m — X m _i I T m -l] 



ed + 



J~ m—l 



ed 
1-e 



1-e. 



[1-e) 
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and since X m _\ is T m -\ -measurable then a.s. 

E[X m | J- m -i] = X m -\. 

Since e G (0, ^] then a.s. 

\X m - X m _i | < max | Y~~~ | = d. 
From Azuma's inequality, for every x > 0, 

P(A fc > kx) < exp ( -^f J (2-47) 

independently of the value of e (note that Xq = a.s.). The concentration inequality in Theorem [5] 
enables one to get a better bound: Since a.s., for every m G N, 



E[(X m — X m _i) 2 | Jm-i] — d 2 e + 

then from (|2,35p 



ed \2,. x d 2 e 



1 



e x 
1 — e a 



and from (|2.46p . for every x > 0, 



F(X fc > fcar) < exp ^-jfe ^( ^^ - + e 1 1 e) V (2.48) 

Consider the case where e — > 0. Then, for arbitrary x > and fe € N, Azuma's inequality in (I2.47P 
provides an upper bound that is strictly positive independently of e, whereas the one-sided concentration 
inequality of Theorem [5] implies a bound in (|2.48|) that tends to zero. This exemplifies the improvement 
that is obtained by Theorem [S] in comparison to Azuma's inequality. 

Remark 6. As was noted, e.g., in |6 S Section 2], all the concentration inequalities for martingales whose 
derivation is based on Chernoff's bound can be strengthened to refer to maxima. The reason is that 
{X^ — Xq, FkYkLn is a martingale, and h(x) = exp(tx) is a convex function on R for every t > 0. Recall 
that a composition of a convex function with a martingale gives a sub-martingale w.r.t. the same filtration 
(see Section [2?L2]), so it implies that {exp(f(Xfc— Xq)), Tk} k=Q is a sub-martingale for every t > 0. Hence, 
by applying Doob's maximal inequality for sub-martingales, it follows that for every a > 

PI max Xk — Xq > an 

\l<k<n 

max exp {t^X^ — Xq)) > exp(ont) ) Vt > 

l<fc<n / 

< exp(— ant) E exp(t(A n — Xq)) 



exp (—ant) E 



ex p ( t ^2 ^ k ) 

^ k=i ' 



which coincides with the proof of Theorem [5] with the starting point in f|2 . 3[) . This concept applies to all 
the concentration inequalities derived in this chapter. 

Corollary 1. Let {X\ t ,TkYk=Q ^ e a discrete-parameter real-valued martingale, and assume that \Xk — 
Xk— 1| < d holds a.s. for some constant d > and for every k £ {1, . . . , n}. Then, for every a > 0, 

P(|X n - X \ > an) < 2 exp {-nf{5)) (2.49) 
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where 

f{s)= f K2)[l-M¥)]> 0<5<1 
( +00, 5 > 1 

and /i2(^) — — ^ log 2 (x) — (1 — x) log 2 (l — x) for < x < 1 denotes the binary entropy function on base 2. 



(2.50) 



Proof. By substituting 7 = 1 in Theorem [5] (i.e., since there is no constraint on the conditional variance, 
then one can take a 2 = d 2 ), the corresponding exponent in (|2.34|) is equal to 



since D( 



D 



ln2[l — fi2(p)] for every p G [0, 1]. 



f(8) 



□ 



Remark 7. Corollarydl which is a special case of Theorem [5] when 7=1, forms a tightened version of the 
Azuma-Hoeffding inequality when = d. This can be verified by showing that f(5) > -y for every 5 > 0, 
which is a direct consequence of Pinsker's inequality. Figure 12.11 compares these two exponents, which 
nearly coincide for 5 < 0.4. Furthermore, the improvement in the exponent of the bound in Theorem [5] is 
shown in this figure as the value of 7 6 (0, 1) is reduced; this makes sense, since the additional constraint 
on the conditional variance in this theorem has a growing effect when the value of 7 is decreased. 



o 

Q_ 

X 



w 
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Z 

z> 

o 

m 
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Corollary 1 : 








f(8) 




Theorem 5: 










■*" - ***^ Exponent of Azuma inequality: 8 2 /2 



8 = a/6 



Figure 2.1: Plot of the lower bounds on the exponents from Azuma's inequality and the improved bounds 
in Theorem [5] and Corollary Q] (where / is defined in (|2.50p ). The pointed line refers to the exponent in 
Corollary [IJ and the three solid lines for 7 = | , j and ^ refer to the exponents in Theorem [5j 



2.3.2 Geometric interpretation 

A common ingredient in proving Azuma's inequality, and Theorem [5] is a derivation of an upper bound 
on the conditional expectation E[e*^ fe | J^-i] for t > where E[£& | T^-i] = 0, Var[^|J r fe_ 1 ] < a 2 , and 
|6c| ^ d a - s - f° r some a, d > and for every k £ N. The derivation of Azuma's inequality and CorollaryQ] 
is based on the line segment that connects the curve of the exponent y(x) = e tx at the endpoints of the 
interval [— d, d]; due to the convexity of y, this chord is above the curve of the exponential function y over 
the interval [— d, d]. The derivation of Theorem [5] is based on Bennett's inequality which is applied to 
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the conditional expectation above. The proof of Bennett's inequality (see Lemma [2]) is shortly reviewed, 
while adopting the notation for the continuation of this discussion. Let X be a random variable with zero 



mean and variance i?[X 2 ] = a 2 , and assume that X < d a.s. for some d > 0. Let 7 = The geometric 
viewpoint of Bennett's inequality is based on the derivation of an upper bound on the exponential function 
y over the interval (— 00, d]; this upper bound on y is a parabola that intersects y at the right endpoint 
(d, e td ) and is tangent to the curve of y at the point (—jd, e~ tld ). As is verified in the proof of Lemma O 
it leads to the inequality y(x) < (j>(x) for every x G (— 00, d] where <ft is the parabola that satisfies the 
conditions 

cj>{d) = y{d) = e td , 0(- 7 d) = y(- 7 d) = e^ d , 0'(- 7 d) = y'(- 7 d) = te^ d . 

Calculation shows that this parabola admits the form 

(x + 7 d)e td + {d- x)e-^ d a[ 7 d 2 + (1 - 7 )d x - x 2 ] 
(p[x) — — ■ — h 



(l + 7 )d (l + 7 ) 2 d 2 

where a = [(1 + 7 )id + l] e~* 7d - e td . Since ELY] = 0, ELY 2 ] = 7 d 2 and X < d (a.s.), then 

E[e* x ] < E[0(X)] 

j e td _|_ e --ytd 

1 + 7 

E[X 2 ]e trf + d 2 e~ m P 
d 2 + E[X 2 ] 

which provides a geometric viewpoint to Bennett's inequality. Note that under the above assumption, the 
bound is achieved with equality when X is a RV that gets the two values +d and — 7 d with probabilities 
jOp^ and jq— , respectively. This bound also holds when E[X 2 ] < <r 2 since the right-hand side of the 
inequality is a monotonic non-decreasing function of E[X 2 ] (as it was verified in the proof Lemma [2]). 
Applying Bennett's inequality to the conditional law of ^ given J~k-i gives (|2.4ip (with 7 in (|2.35p ). 

2.3.3 Improving the refined version of the Azuma-Hoeffding inequality for subclasses 
of discrete-time martingales 

This following subsection derives an exponential deviation inequality that improves the bound in Theo- 
rem [5] for conditionally-symmetric discrete-time martingales with bounded increments. This subsection 
further assumes conditional symmetry of these martingales, as it is defined in the following: 

Definition 2. Let {Xk, ^fe}fceNo) where No = NU{0}, be a discrete-time and real-valued martingale, and 
let £fc = Xk — Xk-i for every k G N designate the jumps of the martingale. Then Tk}keN is called 
a conditionally symmetric martingale if, conditioned on Tk—i, the random variable is symmetrically 
distributed around zero. 

Our goal in this subsection is to demonstrate how the assumption of the conditional symmetry im- 
proves the existing the deviation inequality in Section for discrete-time real- valued martingales with 
bounded increments. The exponent of the new bound is also compared to the exponent of the bound in 
Theorem [5] without the conditional symmetry assumption. Earlier results, serving as motivation to the 
discussion in this subsection, appear in \85\ Section 4] and |86t Section 6]. The new exponential bounds 
can be also extended to conditionally symmetric sub or supermartingales, where the construction of these 
objects is exemplified later in this subsection. Additional results addressing weak-type inequalities, max- 
imal inequalities and ratio inequalities for conditionally symmetric martingales were derived in [57], |88j 
and [89]. 

Before we present the new deviation inequality for conditionally symmetric martingales, this discussion 
is motivated by introducing some constructions of such martingales. 
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Construction of Discrete-Time, Real- Valued and Conditionally Symmetric Sub/ Super- 
martingales 

Before proving the tightened inequalities for discrete-time conditionally symmetric sub/ supermartingales, 
it is in place to exemplify the construction of these objects. 

Example 6. Let (£l,J-,F) be a probability space, and let {£7 fc }fceN Q L 1 ($7, J 7 , P) be a sequence of 
independent random variables with zero mean. Let {J~k}k>o be the natural filtration of sub er-algebras of 
F, where F = {0, ^} and F k = a{U u ■ ■ ■ , U k ) for k > 1. Furthermore, for kN, let A k E L°°(0, F k -i, P) 
be an J^.i-measurable random variable with a finite essential supremum. Define a new sequence of 
random variables in L l {0, : F, P) where 

n 

fc=i 

and Xo = 0. Then, {X n , F n } n ^ Q is a martingale. Lets assume that the random variables {U k } k ^ are 
symmetrically distributed around zero. Note that X n = X n _i + A n U n where A n is F n —\ -measurable and 
U n is independent of the cx-algebra F n -\ (due to the independence of the random variables U\, . . . , U n ). It 
therefore follows that for every n G N, given F n -i, the random variable X n is symmetrically distributed 
around its conditional expectation X n -\. Hence, the martingale {X n ,F n } n( zn is conditionally symmetric. 



Example 7. In continuation to Example El let {X n ,F n } n ^n be a martingale, and define Yq = and 

n 

Y n = Y,MX k -X k ^), VnGN. 
fc=i 

The sequence {Y n ,F n } n ^ is a martingale. If {X n ,F n } n ^ Q is a conditionally symmetric martingale 
then also the martingale {Y n ,F n } n& n is conditionally symmetric (since Y n = Y n _\ + A n (X n — X n _i), 
and by assumption A n is .F n _i-measurable) . 

Example 8. In continuation to Example [6j let {Uk}keH be independent random variables with a sym- 
metric distribution around their expected value, and also assume that E(J7fc) < for every k E N. 
Furthermore, let 6 L°°(0, J^_i,P), and assume that a.s. A k > for every fceN. Let {X n ,F n } ne ^ 
be a martingale as defined in Example El Note that X n = X n ^\ + A n U n where A n is non- negative and 
J-" n _i-measurable, and U n is independent of F n — i and symmetrically distributed around its average. This 
implies that {X n , F n } n ^ is a conditionally symmetric supermartingale. 

Example 9. In continuation to Examples [7] and El let {X n , F n } ne ^ be a conditionally symmetric 
supermartingale. Define {^n}neN as i n Example [3 where A k is non-negative a.s. and J^-i-measurable 
for every k G N. Then {Y n , F n } n ^ is a conditionally symmetric supermartingale. 

Example 10. Consider a standard Brownian motion (Wt)t>o- Define, for some T > 0, the discrete-time 
process 

X n = W nT , F n = a({W t } <t<nT), VneN . 

The increments of (Wt)t>o over time intervals [t k -i,tk] are statistically independent if these intervals do 
not overlap (except of their endpoints), and they are Gaussian distributed with a zero mean and variance 
t k — tk-i. The random variable £ n = X n — X n ^\ is therefore statistically independent of F n —\, and it 
is Gaussian distributed with a zero mean and variance T. The martingale -[_X^ n , J~n\ns 

N is therefore 

conditionally symmetric. 
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After motivating this discussion with some explicit constructions of discrete-time conditionally sym- 
metric martingales, we introduce a new deviation inequality for this sub-class of martingales, and then 
show how its derivation follows from the martingale approach that was used earlier for the derivation of 
Theorem [5j The new deviation inequality for the considered sub-class of discrete-time martingales with 
bounded increments gets the following form: 

Theorem 6. Let {X k ,J- k } k£ ^ De a discrete-time real- valued and conditionally symmetric martingale. 
Assume that, for some fixed numbers d, a > 0, the following two requirements are satisfied a.s. 

\X k - X fc _i| < d, Var(X fc |.F fc _i) = E[(X k - AVi) 2 I ? k -i] < <r 2 (2-51) 

for every k G N. Then, for every a > and n G N, 

P ( max \X k - X \ > an) < 2 exp(-n£(7, 5)) (2.52) 

V l<fc<n I 

where 7 and 8 are introduced in (|2.35|) . and for 7 G (0, 1] and 5 G [0, 1) 

£(7,<5) =£x-ln(l + 7[cosh(x)-l]) (2.53) 



^.fc^FpwEM . (2 . 54) 

If 5 > 1, then the probability on the left-hand side of (|2.52p is zero (so E(^y,8) = +00), and £'(7, 1) = 
ln(^). Furthermore, the exponent E(j,8) is asymptotically optimal in the sense that there exists a 
conditionally symmetric martingale, satisfying the conditions in (|2.5ip a.s., that attains this exponent in 
the limit where n — > 00. 

Remark 8. From the above conditions, without any loss of generality, a 1 < d 2 and therefore 7 G (0, 1]. 
This implies that Theorem [6] characterizes the exponent E(j, 5) for all values of 7 and 5. 

Corollary 2. Let {U k } k x L 1 G L 2 (0,J r , P) be i.i.d. and bounded random variables with a symmetric 
distribution around their mean value. Assume that \Ui — K[Ui]\ < d a.s. for some d > 0, and Var(C/i) < 
7<i 2 for some 7 G [0, 1]. Let {S n } designate the sequence of partial sums, i.e., S n = Y2 k =i U k for every 
n G N. Then, for every a > 0, 

P fmax \S k - kE(Ui)\ > an^j < 2 exp [-nE^, 8)) , Vn G N (2.55) 
where 5 = ^, and £(7,5) is introduced in (|2.53j) and (|2.54|) , 

Remark 9. Theorem [6] should be compared to Theorem [5] (see [80, Theorem 6.1] or [84, Corollary 2.4.7]), 
which does not require the conditional symmetry property. The two exponents in Theorems [6] and [5] are 
both discontinuous at 5 = 1. This is consistent with the assumption of the bounded jumps that implies 
that P(|X n — Xq\ > ndS) is equal to zero if 5 > 1. 

If 8 -)• 1~ then, from (1233]) and (l234j) . for every 7 G (0, 1], 

lim E(j, 5) = lim \x - ln(l + 7(cosh(x) - 1))1 = In f - ) . (2.56) 

S-¥l~ x^co v ^7 J 

On the other hand, the right limit at 8 = 1 is infinity since £(7,5) = +00 for every 5 > 1. The same 
discontinuity also exists for the exponent in Theorem [5] where the right limit at 5 = 1 is infinity, and the 
left limit is equal to 



lim D ' 



1 + 7/ V 7 



^ ln(l + -l (2.57) 
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where the last equality follows from (|2.36j) . A comparison of the limits in (|2.56p and (|2.57p is consistent 
with the improvement that is obtained in Theorem [6] as compared to Theorem [5] due to the additional 
assumption of the conditional symmetry that is relevant if 7 € (0, 1). It can be verified that the two 
exponents coincide if 7 = 1 (which is equivalent to removing the constraint on the conditional variance), 
and their common value is equal to f(S) as is defined in (|2.50p . 

We prove in the following the new deviation inequality in Theorem [6j In order to prove Theorem [6] 
for a discrete-time, real-valued and conditionally symmetric martingale with bounded jumps, we deviate 
from the proof of Theorem This is done by a replacement of Bennett's inequality for the conditional 
expectation in (|2.4ip with a tightened bound under the conditional symmetry assumption. To this end, 
we need a lemma to proceed. 

Lemma 3. Let X be a real- valued RV with a symmetric distribution around zero, a support [— d, d], 
and assume that E[X 2 ] = Var(X) < ^d 2 for some d > and 7 G [0, 1]. Let h be a real- valued convex 
function, and assume that h(d 2 ) > h(0). Then 

E[h(X 2 )} < (1 - 7 )/i(0) + -fh(d 2 ) (2.58) 

where equality holds for the symmetric distribution 

F(X = d) = P(X = -d) = 1, P(X = 0) = l-7. (2.59) 



Proof. Since h is convex and supp(X) = [-d,d], then a.s. h{X 2 ) < h(0) + {-f (h(d 2 ) - h(0)). Taking 
expectations on both sides gives (|2.58p . which holds with equality for the symmetric distribution in 



Corollary 3. If X is a random variable that satisfies the three requirements in Lemma [3] then, for every 

Age, 

E [exp(AX)] < 1 + 7 [cosh(Ad) - l] (2.60) 

and (|2.60p holds with equality for the symmetric distribution in Lemma [3l independently of the value of 
A. 

Proof. For every A G K, due to the symmetric distribution of X, E[exp(AX)] = E[cosh(AX)] . The claim 

now follows from Lemma [3] since, for every x G R, cosh(Ax) = h(x 2 ) where h(x) = X^nLo (2n)! ^ s a 
convex function (h is convex since it is a linear combination, with non-negative coefficients, of convex 
functions), and h{d 2 ) = cosh(Acf) > 1 = h(0). □ 

We continue with the proof of Theorem [6j Under the assumption of this theorem, for every k G N, 
the random variable = X^ — X^\ satisfies a.s. E[^ | J-k-i] = and E[(^) 2 | J-k-i] < c 2 - Applying 
Corollary [3] for the conditional law of given J~k-i, it follows that for every k G N and t G M 

E [exp(^ fc ) I Tk-i] < 1 + 7 [cosh(td) - l] (2.61) 

holds a.s., and therefore it follows from (|2.4p and (|2.6ip that for every t £ R 

< (l + 7 [cosh(td) - l] ) n . (2.62) 



E 



exp [t ^ 6 
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By applying the maximal inequality for submartingales, then for every a > and n G N 



max (X k — Xq) > an 

Kk<n 



max exp (t(X k — Xq)) > exp(ant) I Vi > 

Kk<n 



< exp(-ant) E exp(i(X„ — Xq)) 



exp(— ant) E 



Therefore, from (|2.63p . for every t > 0, 



exp 



fe=l 



'( max (X k — Xq) > an ) < exp(— ant) ( 1 + 7 \cosh(td) — 1 



From (|2.35p and a replacement of td with x, then for an arbitrary a > and n G N 

max (Xk — Xq) > an) < inf < exp ( — n 5x — ln(l + 7[cosh(x) — l]) J > . 



(2.63) 



(2.64) 



(2.65) 



Applying (|2.65p to the martingale {— X^, F^keNo gives the same bound on P(min 1 </ c <„(X/ c — Xq) < —an) 
for an arbitrary a > 0. The union bound implies that 



max \X k - Xq\ > an) < P ( max (X k - X ) > an ) + P ( min (X k - Xq 

l<k<n ) \\<k<n ) \l<k<n 



< —an 



(2.66) 



This doubles the bound on the right-hand side of (|2.65p . thus proving the exponential bound in Theorem[6j 

Proof for the asymptotic optimality of the exponents in Theorems and\^ In the following, we show 
that under the conditions of Theorem [6j the exponent -£^(7, S) in (|2.53p and (|2.54p is asymptotically 
optimal. To show this, let d > and 7 G (0,1], and let U\,U2,-- - be i.i.d. random variables whose 
probability distribution is given by 



F(Ui = d)= F(Ui = -d) = j-, F(Ui = 0) = 1 - 7, Mi G N. 



(2.67) 



Consider the particular case of the conditionally symmetric martingale {X n ,J r n } ng N in Example [6] (see 
Section I2.3.3H where X n = Y^l=x^i f° r « € N, and Xq = 0. It follows that \X n — X n -\\ < d and 
Var(X n | J- n -{) = 'yd 2 a.s. for every n G N. From Cramer's theorem in R, for every a > E[f/i] = 0, 

lim - lnP(X n - Xq > an) 

n— >oo n 



lim - lnpf- y^Ui > a] 
n,->oo n \n ^-^ I 
v i=i 7 



-1(a) 



where the rate function is given by 



1(a) = sup {ta - mE[exp(*C/i)]} 



(2.68) 



(2.69) 



(see, e.g., [84, Theorem 2.2.3] and [84, Lemma 2.2.5(b)] for the restriction of the supermum to the interval 
[0,oo)). From §ZM§ and flUM) , for every a > 0, 



1(a) = sup {ta - ln(l + <y[cosh(td) - 1]) } 
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but it is equivalent to the optimized exponent on the right-hand side of f|2.64j) . giving the exponent of 
the bound in Theorem [6j Hence, 1(a) = E(j,6) in (I2.53D and (I2.54D . This proves that the exponent of 
the bound in Theorem [6] is indeed asymptotically optimal in the sense that there exists a discrete-time, 
real- valued and conditionally symmetric martingale, satisfying the conditions in (12.51H a.s., that attains 
this exponent in the limit where n — > oo. The proof for the asymptotic optimality of the exponent in 
Theorem [5] (see the right-hand side of (12.34p ) is similar to the proof for Theorem (H except that the i.i.d. 
random variables U\, U2, ■ ■ ■ are now distributed as follows: 



nui = d) 



7 



1+7' 



F(Ui = - 7 d) 



1 



1 + 7 



Vi G N 



and, as before, the martingale {X n , J-" n }neNo is defined by X n = Y^i=i^i an d Fn — &(Ui, ■ ■ ■ , U n ) for 
every n G N with Xq = and J-q = {0, £1} (in this case, it is not a conditionally symmetric martingale 
unless 7 = 1). 

Theorem [6] provides an improvement over the bound in Theorem [5] for conditionally symmetric mar- 
tingales with bounded jumps. The bounds in Theorems [5] and [6] depend on the conditional variance of the 
martingale, but they do not take into consideration conditional moments of higher orders. The following 
bound generalizes the bound in Theorem [61 but it does not admit in general a closed- form expression. 

Theorem 7. Let {X k , J-k}k£n be a discrete-time and real-valued conditionally symmetric martingale. 
Let m G N be an even number, and assume that the following conditions hold a.s. for every k G N 

\X k - X k ^\ < d, E[(X fe -X fe _ 1 ) z |.7Vi] < W, V / G {2, 4, . . . , m} 
for some d > and non- negative numbers {u?, A*4> • • • , a m}- Then, for every a > and n G N, 

-1 



max \X k — X(j\ > an < 2 < min e 

Kk<n I x>0 



2 * 21 
I21 ar' 



X (2/)! 



+ 7 m (cosh(x) - l) 



where 



a 

7V 



722 



d? v 



VI G 



(2.70) 



(2.71) 



Proof. The starting point of the proof of Theorem [7] relies on (|2.63p and (|2.4p . For every k G N and t G R, 
since E^ 2 .' -1 | Tk-i] = for every I G N (due to the conditionally symmetry property of the martingale), 



E[exp(t&)|.Ffc- 



Til -I 

2 ~ L 



1 . v g^[CL^zj i y g^Igj^zj 
tr ( 2 ^ ! (20' 

2 " 1 (^E[(%) 2Z |^-i] , f (td) 2I E[(%) 2, |.F fc _i] 



(20! ^ (20! 

1_ 2 

< , ; x (Id) 21 721 , ^ (td) 2I 7m 



i+ V ^ 72; + V 
U W tL ( 2 



m 

x ^ (tri) 2 * ( 72/ - 7m ) 
1 + Z> W)\ + 7m(cosh(t(i) - 1) (2.72) 



2=1 
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where the inequality above holds since |%| < 1 a.s., so that < . . . < 7 TO < . . . < 74 < 72 < 1, and the 
last equality in (|2.72p holds since cosh(x) = Yl^=o ^yr f° r every i£M. Therefore, from 



E 



exp ( t ^2 €k 



k=l 



< 



, \- (td) 21 (72; ~ 7m) , , , ( , ,, ,-, 

1 + z> (2Ai — + 7m L cosh ( td ) - M 



1=1 



(2.73) 



for an arbitrary t£R. The inequality then follows from (|2.63p . This completes the proof of Theorem [71 

□ 



2.3.4 Concentration inequalities for small deviations 

In the following, we consider the probability of the events {\X n — Xq\ > a-y/ra} for an arbitrary a > 0. 
These events correspond to small deviations. This is in contrast to events of the form {\X n — Xq\ > an}, 
whose probabilities were analyzed earlier in this section, referring to large deviations. 

Proposition 1. Let {X k , T k } be a discrete-parameter real- valued martingale. Then, Theorem [5] implies 
that for every a > 

c2 

F{\X n -X \ > ajn) < 2exp(-— )(l + 0(n-5)). (2.74) 



Proof. See Appendix 12. Ai □ 

Remark 10. Prom Proposition [TJ the upper bound on P(|X n — Xq\ > a^Jn) (for an arbitrary a > 0) 
improves the exponent of Azuma's inequality by a factor of ~. 



2.3.5 Inequalities for sub and super martingales 



Upper bounds on the probability ¥(X n — Xq > r) for r > 0, earlier derived in this section for martingales, 
can be adapted to super-martingales (similarly to, e.g., [11] Chapter 2] or [12] Section 2.7]). Alternatively, 
replacing {X k , Tk} k=0 with {—X k , Tk} k=0 provides upper bounds on the probability P(X n — Xq < —r) 
for sub-martingales. For example, the adaptation of Theorem [5] to sub and super martingales gives the 
following inequality: 



Corollary 4. Let {X k , F^^Lo be a discrete-parameter real-valued super-martingale. Assume that, for 
some constants d, a > 0, the following two requirements are satisfied a.s. 



X k - E[X k I F k _ x ] < d, 

VarpffclJW) 4 E U Xk _ E[Xk 1 Fk l] f 1 Tk _ x 



for every k 6 {1, . . . , n}. Then, for every a > 0, 



\X n - X > an) < exp -n D 



5 + 7 



1 + 7 



7 



1 + 7 



(2.75) 



where 7 and 5 are defined as in (|2.35p . and the divergence D(p\\q) is introduced in (|2.36|) . Alternatively, 
if {X k , -^fejfcLo ^ s a sub-martingale, the same upper bound in (I2.75P holds for the probability P(X n — Xq < 
—an). If 5 > 1, then these two probabilities are equal to zero. 
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Proof. The proof of this corollary is similar to the proof of Theorem [5j The only difference is that for a 
super-martingale, due to its basic property in Section 12.1.21 

n n 

x n — x = ^(^A; - Xk-i) < y^Cfc 

fc=l fc=l 

a.s., where ^ = X k — E[A^. | Tk-i] is T^-measurable. Hence ¥((X n — Xq > an) < JP(X^fe=i£fc — an ) 
where a.s. ^ < d, E[£& | J^-i] = 0, and Var(^ | T^-i) < c 2 . The continuation of the proof coincides 
with the proof of Theorem [5] (starting from (j2.3|) ) . The other inequality for sub-martingales holds due to 
the fact that if {X k , Ft} is a sub-martingale then {— Xk, Fk} is a super-martingale. □ 



2.4 Freedman's inequality and a refined version 

We consider in the following a different type of exponential inequalities for discrete-time martingales 
with bounded jumps, which is a classical inequality that dates back to Freedman [90J. Freedman's 
inequality is refined in the following to conditionally symmetric martingales with bounded jumps (see 
|91j). Furthermore, these two inequalities are specialized to two concentration inequalities for sums of 
independent and bounded random variables. 

Theorem 8. Let {X n , Fn}neNo be a discrete-time real-valued and conditionally symmetric martingale. 
Assume that there exists a fixed number d > such that = X k — X k —\ < d a.s. for every k 6 N. Let 

n 

Q n 4^E[^|J- fe _i] (2.76) 

k=l 

with Qo = 0, be the predictable quadratic variation of the martingale up to time n. Then, for every 
z,r > 0, 

max (X k - X ) >z,Q n <r for some n G N ) < exp ( • C ( — ] ) (2.77) 



,i<fc<« / \ 2r 

where 

A 2[Msinh- 1 (u) - Vl + u 2 + 1] 



C(u) = - , ! — , Vn>0. (2.78) 

u z 



Theorem [8] should be compared to Freedman's inequality in |90^ Theorem 1.6] (see also \84\ Exer- 
cise 2.4.21(b)]) that was stated without the requirement for the conditional symmetry of the martingale. 
It provides the following result: 

Theorem 9. Let {X n , J 7 n }nen be a discrete-time real-valued martingale. Assume that there exists a 
fixed number d > such that = X k — X k -\ < d a.s. for every k £ N. Then, for every z, r > 0, 

max (X k - X ) > z, Q n <r for some n 6 < exp ( '-— • 5 f — ^ ^ (2.79) 
i<fc<n / \ 2r \ r / / 

where 

4 2[(1 + U )ln(l +«)-«] Vn>0 _ (2>80) 



The proof of |9U| Theorem 1.6] is modified in the following by using Bennett's inequality for the 
derivation of the original bound in Theorem [9] (without the conditional symmetry requirement). Fur- 
thermore, this modified proof serves to derive the improved bound in Theorem [8] under the conditional 
symmetry assumption of the martingale sequence. 

We provide in the following a combined proof of Theorems [8] and 
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Proof. The proof of Theorem [8] relies on the proof of Freedman's inequality in Theorem [9l where the 
latter dates back to Freedman's paper (see [90|, Theorem 1.6], and also [844 Exercise 2.4.21(b)]). The 
original proof of Theorem [9] (see \90\ Section 3] ) is modified in a way that facilitates to realize how the 
bound can be improved for conditionally symmetric martingales with bounded jumps. This improvement 
is obtained via the refinement in (|2.6ip of Bennett's inequality for conditionally symmetric distributions. 
Furthermore, the following revisited proof of Theorem [9] simplifies the derivation of the new and improved 
bound in Theorem [8] for the considered subclass of martingales. 

Without any loss of generality, lets assume that d = 1 (otherwise, {Xk} and z are divided by d, and 
{Qk} arid r are divided by d 2 ; this normalization extends the bound to the case of an arbitrary d > 0). 
Let S n = X n — Xq for every n G No, then {S n , J^jneNo is a martingale with So = 0. The proof starts by 
introducing two lemmas. 

Lemma 4. Under the assumptions of Theorem [91 let 

U n 4 exp(A5 n - 9Q n ), Vne{0,l,...} (2.81) 

where A > and 9 > e x — A — 1 are arbitrary constants. Then, {U n ,J- n } n ^ is a supermartingale. 

Proof. U n in (|2.8ip is J^-measurable (since Q n in ()2.76j) is J^-i-measurable, where F n -\ C T n , and S n 
is J^-measurable) , Q n and U n are non-negative random variables, and S n = X)fe=i £fc — n a - s - (since 
£ k < 1 and S = 0). It therefore follows that < U n < e Xn a.s. for X,9 > 0, so U n G L l (n,T n ,F). It is 
required to show that E[J7 n | J- n -x] < U n -i holds a.s. for every n G N, under the above assumptions on 
the parameters A and 9 in (j2.81|) . 

= exp(-9Q n ) exp(A5 n _i)E[exp(A^„) | T n -i] 

( = } exp(A5„_i) exp(-0(Q n _ 1 + E[^|7^i])) E[cxp(A£„) | Jn-i] 
(c) / E[exp(A^)|J- w _!] \ 

"^-^exp^l^J ( 2 - 82 ) 

where (a) follows from (|2.8ip and because Q n and S n -i are J^-i-measurable and S n = S n -i + £ n , (b) 
follows from ()2.76|) . and (c) follows from (j2.81|) . 

A modification of the original proof of Lemma[J](see [9U\ Section 3]) is suggested in the following, which 
then enables to improve the bound in Theorem [9] for real- valued, discrete-time, conditionally symmetric 
martingales with bounded jumps. This leads to the improved bound in Theorem [8] for the considered 
subclass of martingales. 

Since by assumption £ n < 1 and E[£ n | F n -\\ = a.s., then applying Bennett's inequality in ()2.41|) to 
the conditional expectation of e x ^ n given T n —\ (recall that A > 0) gives 

wr n . exp(-AE[^| T n ^\) + E[^| T n . x \ exp(A) 
E[ex P (AC„) \T n -r\ < 1 + E[g 1^] 

which therefore implies from (|2.82p and the last inequality that 

nu n \T n ^] < u n . x y — i+E^i^ — + i+Etai-F^] J • (2 - 83) 

In order to prove that Eft/njJ^-i] < U n -\ a.s., it is sufficient to prove that the second term on the 
right-hand side of (|2.83p is a.s. less than or equal to 1. To this end, lets find the condition on A, 9 > 
such that for every a > 

(tt^) ex p(-«( A + *)) + (tt^) exp(A " a6) - 1 (2 - 84) 
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which then assures that the second term on the right-hand side of (|2.83j) is less than or equal to 1 a.s. as 
required. 

Lemma 5. If A > and 9 > exp(A) — A — 1 then the condition in (|2.84p is satisfied for every a > 0. 

Proof. This claim follows by calculus, showing that the function 

g(a) = (1 + a) exp(a9) — aexp(A) — exp(— aA), Va > 

is non-negative on if A > and 9 > exp(A) — A — 1. □ 

From (12.831) and Lemma El it follows that {U n , J-njneNo is a supermartingale if A > and 9 > 
exp(A) — A — 1. This completes the proof of Lemma HI □ 

At this point, we start to discuss in parallel the derivation of the tightened bound in Theorem [8] for 
conditionally symmetric martingales. As before, it is assumed without any loss of generality that d = 1. 

Lemma 6. Under the additional assumption of the conditional symmetry in TheoremEl then {U n , -7>i}neNo 
in (|2.8ip is a supermartingale if A > and 9 > cosh(A) — 1 are arbitrary constants. 

Proof. By assumption £ n = S n — S n -i < 1 a.s., and £ ra is conditionally symmetric around zero, given 
•Fn-l, for every n G N. By applying Corollary [3] to the conditional expectation of exp(A£ n ) given F n -i, 
for every A > 0, 

E [exp(A£ n ) | Tn-i] < 1 + ml I F n -i] (cosh(A) - l) . (2.85) 
Hence, combining (|2.82p and (|2.85p gives 

w^i <_ v„ ( l±m\^^zA ) . „ 86 , 

y exp(9E[^\J r n ^ 1 \) J 

Let A > 0. Since E[£^ | F n -\] > a.s. then in order to ensure that {U n , J-njneNo forms a supermartingale, 
it is sufficient (based on (|2.86p ) that the following condition holds: 

1 + a(cosh(A) - l) 

— 77TT "<!> Va>0. 2.87 

Calculus shows that, for A > 0, the condition in (|2.87p is satisfied if and only if 

9 > cosh(A) - 1 = 6» min (A). (2.88) 
From (|2.86|) . {U n , -FnjngNo is a supermartingale if A > and 9 > 6* min (A). This proves LemmaEJ □ 

Hence, due to the assumption of the conditional symmetry of the martingale in Theorem the set 
of parameters for which {U n , J-" ra } is a supermartingale was extended. This follows from a comparison of 
Lemma H] and [6] where indeed exp(A) — 1 — A > # m in(A) > for every A > 0. 

Let z, r > 0, A > and either 9 > cosh(A) — 1 or 9 > exp(A) — A — 1 with or without assuming the 
conditional symmetry property, respectively (see Lemma H] and [6]) . In the following, we rely on Doob's 
sampling theorem. To this end, let M S N, and define two stopping times adapted to {Fn}. The first 
stopping time is a = 0, and the second stopping time f3 is the minimal value of n G {0, . . . , M} (if any) 
such that S n > z and Q n < r (note that S n is J^-measurable and Q n is J^-i-measurable, so the event 
{/3 < n} is J^-measurable) ; if such a value of n does not exist, let f3 = M . Hence a < (3 are two bounded 
stopping times. From Lemma H] or EJ {U n , J^lneNo is a supermartingale for the corresponding set of 
parameters A and 9, and from Doob's sampling theorem 



E[U P ] < E[U ] = 1 



(2.89) 
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(So = Qo = 0, so from (|2.8ip . Uq = 1 a.s.). Hence, it implies the following chain of inequalities: 

P(3 n < M : S n >z,Q n < r) 
Ux) ^(Sp>z,Qp<r) 



< F(\Sp - OQp >Xz- Or) 
W E[exp(A5 /3 - 9Qf))] 

exp(Az — Or) 
(d) E[Ufs] 
exp(Az — Or) 

(e) 

< exp(-(A*-0r)) (2.90) 

where equality (a) follows from the definition of the stopping time /3 G {0, . . . , M}, (b) holds since A, > 0, 
(c) follows from Chernoff's bound, (d) follows from the definition in (|2.8ip . and finally (e) follows from 
(I2,89p . Since (I2.90P holds for every M G N, then from the continuity theorem for non-decreasing events 
and tf£M]) 

P(3n G N : 5„ > z,Q n < r) 

= lim F(3n<M: S n > z,Q n < r) 

M—>oo 

<exp(-(Xz- Or)). (2.91) 

The choice of the non-negative parameter as the minimal value for which (|2.9ip is valid provides the 
tightest bound within this form. Hence, without assuming the conditional symmetry property for the 
martingale {X n , J- n }, let (see LemmalU = exp(A) — A — 1. This gives that for every z,r > 0, 



P(3n G N : S n > z, Q n < r) < exp(^-[Az - (exp(A) - A - l)r]J, VA > 0. 
The minimization w.r.t. A gives that A = In (l + -J, and its substitution in the bound yields that 



z 2 



P(3neN: S n > z, Q n < r) < exp f-— • B (~J J (2.92) 

where the function B is introduced in (|2.80p . 

Furthermore, under the assumption that the martingale {X n , ^F n }neNo is conditionally symmetric, let 
= #min(A) (see Lemma [6]) for obtaining the tightest bound in (I2.9ip for a fixed A > 0. This gives the 
inequality 

P(3neN: S n > z, Q n < r) < exp(- [Xz - r ^(A)]) , VA>0. 



The optimized A is equal to A = sinh 1 (~ J . Its substitution in (|2.88|) gives that m - m (X) = y 1 + ^ — 1, 
and 



z 2 „/ z 



P(3nGN: S n > z, Q n < r) < exp f-— • C^-J J (2.93) 

where the function C is introduced in (|2.78p . 

Finally, the proof of Theorems [8] and [9] is completed by showing that the following equality holds: 

A = {3nGN: S n > z,Q n < r} 

= {3 n G N : max S k >z,Q n <r} = B. (2.94) 

Kk<n 
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Clearly A C B, so one needs to show that B C A. To this end, assume that event B is satisfied. 
Then, there exists some n G N and k G {1, . . . ,n} such that Sk > z and Q n < r. Since the predictable 
quadratic variation process {Q n }neN m (|2.76p is monotonic non-decreasing, then it implies that Sk > z 
and Qk < r; therefore, event ^4 is also satisfied and B C A. The combination of (|2.93j) and (j2.94j) 
completes the proof of Theorem [HJ and respectively the combination of (|2.92[) and (|2.94p completes the 
proof of Theorem [9) □ 



Freedman's inequality can be easily specialized to a concentration inequality for a sum of centered 
(zero- mean) independent and bounded random variables (see Example H]). This specialization gives 
Bennett's inequality |92j (see Section [2.3.ip . which can be loosened to get Bernstein's inequality (as is 
explained below). Furthermore, the refined inequality in Theorem [8] for conditionally symmetric mar- 
tingales with bounded jumps can be specialized (again, via Example [T]) to an improved concentration 
inequality for a sum of i.i.d. and bounded random variables that are symmetrically distributed around 
zero. This leads to the following result: 



Corollary 5. Let {Ui}f =1 be i.i.d. and bounded random variables such that E[£/i] 
\U\\ < d a.s. for some constant d > 0. Then, the following inequality holds: 



0, E[Ul 



o 



and 



%=i 



> a I < 2exp 



no 
~d? 



ad 



no' 



Va > 



(2.95) 



where 4>i(x) = (1 + x)ln(l + x) — x for every x > 0. Furthermore, if the i.i.d. and bounded random 
variables {C/j}™ =1 have a symmetric distribution around zero, then the bound in (|2.95p can be improved 
to 



2> 

i=i 



> a < 2exp 



no 



ad 



no' 



Va > 



(2.96) 



where <p2{x) — xsinh 1 (x) — Vl + x 2 + 1 for every x > 0. 



Proof. Inequality (I2.95P follows from Freedman's inequality in Theorem [91 and inequality (I2.96P follows 
from the refinement of Freedman's inequality for conditionally symmetric martingales in Theorem [HJ 
These two theorems are applied here to the martingale sequence {Xk, J-fe}^ =0 where Xk = YH=i U% and 
Tk = o~(Ui, . . . , Uk) for every k G {1, . . . , n}, and Xq = 0, F$ = {0, $7}. The corresponding predictable 
quadratic variation of the martingale up to time n for this special case of a sum of i.i.d. random variables 
is Q n = Yli=l^i^f] = n(j2 ■ The result now follows by taking z 



no 



in inequalities (pTTTj) and ([2779]) 
(with the related functions that are introduced in (12.80P and (j2.78j) . respectively). Note that the same 
bound holds for the two one-sided tail inequalities, giving the factor 2 on the right-hand sides of (|2,95p 
and (I2T96D . □ 



Remark 11. Bennett's concentration inequality in (12.95P can be loosened to obtain Bernstein's inequality. 
To this end, the following lower bound on <p\ is used: 



Or > 



x 



2 + 



2x ' 



Vx > 0. 



This gives the inequality 



i=i 



> a < 2exp 



a 



2no 2 + ^f 



Va > 0. 



36 



CHAPTER 2. THE MARTINGALE APPROACH AND APPLICATIONS 



2.5 Relations of the refined inequalities to some classical results in 
probability theory 

2.5.1 Link between the martingale central limit theorem (CLT) and Proposition [1] 

In this subsection, we discuss the relation between the martingale CLT and the concentration inequalities 
for discrete-parameter martingales in Proposition [TJ 

Let (Q,J-", P) be a probability space. Given a filtration {J-fc}, then {Y^, F^kLo is said to be a 
martingale-difference sequence if, for every k, 

1. Yfc is J-fc-measurable, 

2. E[|y fc |] < oo, 

3. E[Y fc | J-fe-i] =0. 
Let 

n 

Sn = Y, Y ^ VnGN 
fc=l 

and So = 0, then {Sk, -^fcjfcLo ^ s a martingale. Assume that the sequence of RVs {Y^} is bounded, i.e., 
there exists a constant d such that ll^l < d a.s., and furthermore, assume that the limit 



1 n 

a 2 ^ lim ij^E[lf|7 w ] 



«— >oo n 

fc=l 



exists in probability and is positive. The martingale CLT asserts that, under the above conditions, 

converges in distribution (i.e., weakly converges) to the Gaussian distribution J\f(0, a 2 ). It is denoted 
by ^= => A/"(0, a 2 ). We note that there exist more general versions of this statement (see, e.g., [931 
pp. 475-478]). 

Let {Xk,J 7 k}'kLo be a discrete-parameter real- valued martingale with bounded jumps, and assume 
that there exists a constant d so that a.s. for every k G N 

\x k -x k -i\<d, VfceN. 

Define, for every fcgN, 



Yi, — Xh — X, 



k-1 



and Yq = 0, so {Y k , J-fcj^g is a martingale-difference sequence, and |Yfc| < d a.s. for every k G N U {0}. 
Furthermore, for every n G N, 



n 



S n — ^2 Y k = X n — Xq. 
k=l 

Under the assumptions in Theorem [5] and its subsequences, for every k G N, one gets a.s. that 

E[Y fc 2 | F k -i] = E[(X k - X^f | T k -i] < a 2 . 
Lets assume that this inequality holds a.s. with equality. It follows from the martingale CLT that 

Af(0,a 2 



X n — Xq 2 



and therefore, for every a > 0, 



lim F(\X n -X \ > a^/n) = 2Q(- 

71— >OC \ (J 
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where the Q function is introduced in (|2.1ip . 

Based on the notation in (12.351) . the equality ^ = holds, and 

lim P(|X n -X | >aV^) = 2Q[ -A. ). (2.97) 
n-yoc VV7/ 

Since, for every x > 0, 



Q(s) < - exp I -y 



then it follows that for every a > 



lim P(|X„ - X | > ay^) < exp f-^ . 

This inequality coincides with the asymptotic result of the inequalities in Proposition [1] (see (|2.74p in the 
limit where n — > oo), except for the additional factor of 2. Note also that the proof of the concentration 
inequalities in Proposition [1] (see Appendix 12. A[) provides inequalities that are informative for finite n, 
and not only in the asymptotic case where n tends to infinity. Furthermore, due to the exponential 
upper and lower bounds of the Q-function in (|2.12p . then it follows from (|2.97[) that the exponent in the 
concentration inequality (|2,74p (i.e., |-) cannot be improved under the above assumptions (unless some 
more information is available). 



2.5.2 Relation between the law of the iterated logarithm (LIL) and Theorem [5] 

In this subsection, we discuss the relation between the law of the iterated logarithm (LIL) and Theorem[5j 
According to the law of the iterated logarithm (see, e.g., [931 Theorem 9.5]) if {Xk}^ =l are i.i.d. 
real- valued RVs with zero mean and unit variance, and S n = Y^i=i Xi for every n £ N, then 

S 

lim sup = = 1 a.s. (2.98) 

n->oo V2nlnlnn 

and 

S 

Hminf — = -1 a.s. (2.99) 

n-+oa V2nlnlnra 

Eqs. (|2.98|) and (|2.99|) assert, respectively, that for every e > 0, along almost any realization, 

S n > (1 - e)V2n In Inn 

and 

S n < -(1 - e)V2n\n Inn 

are satisfied infinitely often (i.o.). On the other hand, Eqs. (I2.98P and (I2.99P imply that along almost any 
realization, each of the two inequalities 

S n > (1 + e)V2n In Inn 

and 

S n < -(1 + e)\/2nln Inn 

is satisfied for a finite number of values of n. 

Let {^fc}^! be i.i.d. real-valued RVs, defined over the probability space (0, T, P), with E[Xi] = 
and E[Xf] =1. 
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Let us define the natural nitration where J-q = {0, il}, and J~k = &(Xi, . . . ,Xk) is the cr-algebra that 
is generated by the RVs Xi,... ,Xk for every k £ N. Let Sq = and S n be defined as above for every 
n 6 N. It is straightforward to verify by Definition [1] that {S n , .F n }J£Lo ^ s a uiartingale. 

In order to apply Theorem[5]to the considered case, let us assume that the RVs {Xi~}^_ 1 are uniformly 
bounded, i.e., it is assumed that there exists a constant c such that \X^\ < c a.s. for every k € N. Since 
E[X 2 ] = 1 then c > 1. This assumption implies that the martingale {S n , J^I^Lq has bounded jumps, 
and for every n £ N 

\S n — S n —\\ < c a.s. 
Moreover, due to the independence of the RVs {X^}^ =1 , then 

Var(S n | J" n _i) = E(V 2 | J^) = E(X 2 ) = 1 a.s.. 

Prom Theorem El it follows that for every a > 



S n > aV 2n In In < exp ^— nD^^- 



5n + 7 



+ 7 



1 + 7 



where 



a 



2 In In n 



n 



7 



Straightforward calculation shows that 



nD 



+7 



1 + 7 

«7 



7 



1 + 7 

(a) n7 



1 + 7 

5, 



1 + 



7 



In 1 + 



<5n 
1 



+ -(l-<5 n )ln(l-<5 n ) 

7 



2V7 2 7/ 6 V 7 7 3 



1 + 7 
2 7 



67^ 



+ 



00 



a 2 In In n 



a(c 2 



Gr- 



in Inn 



+ ... 



v 



where equality (a) follows from the power series expansion 



(2.100) 



(2.101) 



(2.102) 



(1 + u) ln(l + u) = u + ^ 



fc=2 



k(k-l) 



-1 < u < 1 



and equality (b) follows from (|2.10ip . A substitution of (|2,102p into ([2. 10Q[) gives that, for every a > 0, 



Sn > aV2n In Inn] < (inn) 



(2.103) 



and the same bound also applies to P(»S n < —a\l2n In In n) for a > 0. This provides complementary 
information to the limits in ()2.98j) and (|2.99p that are provided by the LIL. From Remark [6J which follows 
from Doob's maximal inequality for sub-martingales, the inequality in (|2.103|) can be strengthened to 



max Sh > aV 2n In In n | < (in n) 



Kk<n 



(2.104) 
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It is shown in the following that (|2.104p and the first Borel-Cantelli lemma can serve to prove one part 
of (I2,98p . Using this approach, it is shown that if a > 1, then the probability that S n > ay/2n In Inn i.o. 
is zero. To this end, let 9 > 1 be set arbitrarily, and define 

A n = \J {s k > aV2Hnln£;} 

k-.e n - 1 <k<e n 

for every n £ N. Hence, the union of these sets is 

A±{jA n =U{s k > av^Adnlnfc} 

neN fceN 

The following inequalities hold (since 9 > 1): 

PUn) < P I max S k > a v / 26' n " 1 lnln(6l n - 1 ) ) 

\0™-l<fc<0« J 

= P ( max S k > \/2^ n lnlnffl"- 1 ) ) 

V0«-i<ifc<0« v 7 ^ / 

< P ( max 5 fc > v 7 ^" lnlnftf"- 1 ) ) 

2 / 

< (nln^)~^"V 1+ ^) (2.105) 
where the last inequality follows from (|2. 1Q4|) with j3 n — > as n — > oo. Since 

^n~^"<oo, Vq > \/# 

n=l 

then it follows from the first Borel-Cantelli lemma that P(^4 i.o.) = for all a > yj~9. But the event A 
does not depend on 9, and 9 > 1 can be made arbitrarily close to 1. This asserts that P(^4 i.o.) = for 
every a > 1, or equivalently 

hmsup = < 1 a.s. 

n-s>oo v2nlnlnn 

Similarly, by replacing {Xi} with {— Aj}, it follows that 

liminf ^ n — > —1 a.s. 
V2nlnlnra 

Theorem [5] therefore gives inequality ()2.104p . and it implies one side in each of the two equalities for the 
LIL in QZggp and flXMp . 

2.5.3 Relation of Theorem [5] with the moderate deviations principle 

According to the moderate deviations theorem (see, e.g., |84l Theorem 3.7.1]) in R, let {Aj}™ =1 be a 
sequence of real- valued i.i.d. RVs such that Ax (A) = E[e AX »] < oo in some neighborhood of zero, and 
also assume that E[Aj] = and a 2 = Var(Aj) > 0. Let {on}^^ be a non-negative sequence such that 
a n — > and na n — > oo as n — > oo, and let 

I n 

Z n = J—^Xi, VneN. (2.106) 

' n i=l 
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Then, for every measurable set rcK, 

=■ inf x 2 

2a 2 aero 

< liminf a n lnP(Z„ G V) 

n— >-oo 

< limsupa n lnP(Z n G T) 

n—toc 

< inf x 2 (2.107) 
2<r z j; e r 

where F° and T designate, respectively, the interior and closure sets of T. 

Let r] G (5, 1) be an arbitrary fixed number, and let {a„}^ =1 be the non-negative sequence 

a n = n 1 - 2ri , Vn G N 

so that a n — > and na n — )• 00 as n — > 00. Let a G M + , and T = (—00, —a] U [a, 00). Note that, from 

(H3D5D, 



P y^JTi|> an" 1 =P(Z n GT) 
so from the moderate deviations principle (MDP), for every a > 0, 

n lim n 1 - 2 ^ InP ( | > cm' J = (2.108) 



Q 2 



It is demonstrated in Appendix [2^B] that, in contrast to Azuma's inequality, Theorem [5] provides an upper 
bound on the probability 

P (j it^ Xi \ - an ^ ' Vn G N, a > 

which coincides with the asymptotic limit in (|2.108p . The analysis in Appendix 12. Bl provides another 
interesting link between Theorem [5] and a classical result in probability theory, which also emphasizes 
the significance of the refinements of Azuma's inequality. 



2.5.4 Relation of the concentration inequalities for martingales to discrete-time 
Markov chains 

A striking well-known relation between discrete-time Markov chains and martingales is the following 
(see, e.g., [94 s P- 473]): Let {X n } ne pj (No = NU {0}) be a discrete-time Markov chain taking values in 
a countable state space S with transition matrix P, and let the function tp : S — > S be harmonic (i.e., 
YljEsPhj^ij) = ^(*)> e *->)> an d assume that E?[|V>(-Xn)|] < 00 for every n. Then, {Y n , J^lneNo i s a 
martingale where Y n = ip(X n ) and {J>i}neNo is t ne natural filtration. This relation, which follows directly 
from the Markov property, enables to apply the concentration inequalities in Section 12.31 for harmonic 
functions of Markov chains when the function ip is bounded (so that the jumps of the martingale sequence 
are uniformly bounded). 

Exponential deviation bounds for an important class of Markov chains, called Doeblin chains (they are 
characterized by an exponentially fast convergence to the equilibrium, uniformly in the initial condition) 
were derived in [95j . These bounds were also shown to be essentially identical to the Hoeffding inequality 
in the special case of i.i.d. RVs (see [95, Remark 1]). 
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2.6 Applications in information theory and related topics 
2.6.1 Binary hypothesis testing 

Binary hypothesis testing for finite alphabet models was analyzed via the method of types, e.g., in |96t 
Chapter 11] and [H7j- It is assumed that the data sequence is of a fixed length (n), and one wishes to 
make the optimal decision based on the received sequence and the Neyman-Pearson ratio test. 
Let the RVs Xi,X 2 .... be i.i.d. ~ Q, and consider two hypotheses: 

H 1 :Q = P 1 . 
H 2 :Q = P 2 . 

For the simplicity of the analysis, let us assume that the RVs are discrete, and take their values on a 
finite alphabet X where Pi(x), P 2 (x) > for every x G X . 
In the following, let 

T(Y v \ Al Pi(Xi,...,X n ) Pi(Xj) 

L(Xl '-'-'^ ) - ln ^(x 1 ,...,x n )-^ ln 7M^) 

designate the log-likelihood ratio. By the strong law of large numbers (SLLN), if hypothesis Hi is true, 
then a.s. 

km L(Xl, -" ,Xn) = D{P l \\P 2 ) (2.109) 

n— s>oo n 

and otherwise, if hypothesis H 2 is true, then a.s. 

lim L(X 1 ,...,X n ) = _ D{p u Pi) (2 no) 

n— >co n 

where the above assumptions on the probability mass functions P\ and P 2 imply that the relative en- 
tropies, D{Pi\\P 2 ) and D(P 2 \\Pi), are both finite. Consider the case where for some fixed constants 
A,AsM that satisfy 

-D{P 2 \\Pi) <X<X< D(Pi\\P 2 ) 

one decides on hypothesis H\ if 

L(Xi,...,X n ) > n\ 

and on hypothesis H 2 if 

L(X 1} ...,X n ) < nA. 

Note that if A = X = A then a decision on the two hypotheses is based on comparing the normalized 
log- likelihood ratio (w.r.t. n) to a single threshold (A), and deciding on hypothesis Hi or H 2 if it is, 
respectively, above or below A. If X < A then one decides on Hi or H 2 if the normalized log-likelihood 
ratio is, respectively, above the upper threshold A or below the lower threshold A. Otherwise, if the 
normalized log-likelihood ratio is between the upper and lower thresholds, then an erasure is declared 
and no decision is taken in this case. 
Let 



c$±I?(L(X 1 ,...,X n )<n\) (2.111) 

4 P^ L (Xi, ...,X n )< nA) (2.112) 

and 

A P^L(Xi, ...,X n )> nX) (2.113) 

^ A P?(l(Xi, ...,X n )> nJ) (2.114) 
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then a4 and (3^ are the probabilities of either making an error or declaring an erasure under, respec- 

(2) (2) 

tively, hypotheses H\ and H 2 ; similarly, a„ and pn are the probabilities of making an error under 
hypotheses Hi and H2, respectively. 

Let 7Ti,7T2 G (0, 1) denote the a-priori probabilities of the hypotheses H\ and H2, respectively, so 

P®L=*lc$+irtf$ (2.115) 

is the probability of having either an error or an erasure, and 

Pj$=*icg)+* 2 ffl (2.H6) 

is the probability of error. 
Exact Exponents 

When we let n tend to infinity, the exact exponents of a^} and f3n^ (J = 1,2) are derived via Cramer's 
theorem. The resulting exponents form a straightforward generalization of, e.g., |84t Theorem 3.4.3] and 
|98} Theorem 6.4] that addresses the case where the decision is made based on a single threshold of the 
log-likelihood ratio. In this particular case where A = A == A, the option of erasures does not exist, and 
Pc]n = Pefn — P e ,n is the error probability. 

In the considered general case with erasures, let 

Ai == —A, A2 — —A 

then Cramer's theorem on R yields that the exact exponents of , off , ffl and ffl are given by 

lim = /(A x ) (2.117) 

n— >oo n 

i (2) 

lim _ l _^L- = j(A 2 ) (2.118) 

n— >oo n 

li m = /(A 2 ) - A 2 (2.119) 

n— >oo n 

?(2) 



In B ( 

lim — = J(Ai) - Ai (2.120) 



n— ¥00 n 



where the rate function / is given by 



and 



I{r) = sup(tr - H(t)) (2.121) 



H(t) = lnj ^2P 1 (x) 1 - t P 2 (x) t j, Vt G R. (2.122) 

The rate function / is convex, lower semi-continuous (l.s.c.) and non-negative (see, e.g., [84J and [98]). 
Note that 

H{t) = (t- l)A(P 2 ||Pi) 

where Dt(P||Q) designates Reyni's information divergence of order t [99, Eq. (3.3)], and I in (|2.12ip is 
the Fenchel-Legendre transform of H (see, e.g., [84|, Definition 2.2.2]). 

From (|2.115p ~- (|2.120p . the exact exponents of Pi,n and P ej n are equal to 

lim J^A = minfl(Ai), 7(A 2 ) - A 2 ) (2.123) 
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and 



lim - 

n— ¥00 



In Pi? 



n 



mm 



{/(A 2 ),/(A 1 )-A 1 }. 



(2.124) 



For the case where the decision is based on a single threshold for the log-likelihood ratio (i.e., Ai 
A 2 = A), then pj,n = Pe,™ — Pe,n, and its error exponent is equal to 



In -P e ,n 



lhn -^1-^1 =min{/(A),/(A) - A} 



(2.125) 



which coincides with the error exponent in [841 Theorem 3.4.3] (or [98, Theorem 6.4]). The optimal 
threshold for obtaining the best error exponent of the error probability P Cjn is equal to zero (i.e., A = 0); 
in this case, the exact error exponent is equal to 



1(0) 



mmln ^P l( x)^P 2 (x)0 

\%£X / 



0<t< 

C(Pi,P 2 



(2.126) 



which is the Chernoff information of the probability measures Pi and P 2 (see |96l Eq. (11.239)]), and 
it is symmetric (i.e., C(Pi,P 2 ) = C(P 2 ,Pi)). Note that, from (I2.121|) . 1(0) = sup teR (-#(*)) = 
— inf( g ]R(P(i)); the minimization in (|2.126j) over the interval [0,1] (instead of taking the infimum of 
H over M) is due to the fact that H(0) = H(l) = and the function H in (|2.122p is convex, so it is 
enough to restrict the infimum of H to the closed interval [0, 1] for which it turns to be a minimum. 

Lower Bound on the Exponents via Theorem [5] 

In the following, the tightness of Theorem [5] is examined by using it for the derivation of lower bounds on 
the error exponent and the exponent of the event of having either an error or an erasure. These results 
will be compared in the next subsection to the exact exponents from the previous subsection. 

We first derive a lower bound on the exponent of a„ . Under hypothesis Hi, let us construct the 
martingale sequence {Pfc,Pfc}/J = o where Po C Pi C . . . P n is the filtration 



and 



For every k G {0, . . . , n} 



P O = {0,O}, F k = a{X u . 

U k =E P n [L(X U 



,X k ), Vfc€{l,...,n} 



U, 



E 1 * 

i=l 



Pi(Xj 

P2(Xi 



1 x n ) 



Pfc 



Pfc 



(2.127) 



y i n y E P n 



i=l 
k 

i=l 



P2(Xi 

PijXj 

P 2 pQ 



In 



P2(Xi) 



i=k+l 

+ {n-k)D{P 1 \\P 2 ). 



In particular 



U 



n£)(Pi||P 2 ), 

i=l 



P2(X t 



L(Xi, . . . ,X V 



(2.128) 
(2.129) 
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and, for every k £ {1, . . . , n}, 



Let 



fc-i 



c?i = max 



In 



P 2 (X fc ) 



D(Pi\\P2] 



(2.130) 



(2.131) 



so d\ < oo since by assumption the alphabet set A' is finite, and Pi(x), P2{x) > for every x £ X. From 
and ([2T3Tjl 

, n} , and due to the statistical independence of the RVs in the sequence 



holds a.s. for every k £ {1 
{Xi} 



Epn[(U k -U k ^) 2 \T k ^ 



In 



P2(X k 



D(Pi\\P2, 



A ^ 

= <T X . 



Let 



ei,i = D{P X \\P 2 ) - A, e 2 ,i = D{P 2 \\P l ) + \ 
ei,2 = D(Pi\\P 2 ) - A, e 2 , 2 = D(P 2 \\P 1 ) + A 



(2.132) 



(2.133) 
(2.134) 



The probability of making an erroneous decision on hypothesis H 2 or declaring an erasure under the 
hypothesis Hi is equal to aj, 1 ', and from Theorem [5] 



a M±P?{L(X 1 ,...,X n )<riZ) 

(a) 



(b) 



Pi(Un -U < -£i,i n) 
&l,i + 7i 



* 1 / /Ol 

< exp I —nDy—^ 



+ 7i 



7i 



l + 7i 



(2.135) 
(2.136) 



where equality (a) follows from (|2,128p . (|2.129p and (|2.133p . and inequality (b) follows from Theorem [5] 
with 

■> 

A £1,1 



71 "df b ^ = —, 



(2.137) 



Note that if £i 5 i > d\ then it follows from (I2.130P and (I2.13ip that is zero; in this case b\^\ > 1, 
so the divergence in (|2.136|) is infinity and the upper bound is also equal to zero. Hence, it is assumed 
without loss of generality that £ [0, 1]. 

Similarly to (|2.127p . under hypothesis H 2 , let us define the martingale sequence {£4, -^fcj^Lo with the 
same filtration and 

U k = Epn [L(Xi, X n ) | F k ] , V*€{0,...,n}. (2.138) 

For every k £ {0, . . . , n} 



k 



i=i 
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and in particular 



For every k £ {1, . . . , n}, 



Let 



U = -nD(P 2 \\P 1 ), U n = L(Xx,...,X n ) 



do = max 



In 



P 2 (X fc ) 

P 2 (x) 
Pi Or) 



D{Pi\\Pi\ 



(2.139) 
(2.140) 
(2.141) 



then, the jumps of the latter martingale sequence are uniformly bounded by d 2 and, similarly to ()2.132p . 
for every k G {1, . . . , n} 

Ep ? [(^-^-i) 2 |^-i] 

2"! 



Sv 2<1) ( ln W>~ DmPi) ) 



= cr 2 . 



Hence, it follows from Theorem [5] that 

±P?(L(X 1 ,...,X n )>n\) 
= P?(U n -U >e 2 ,m) 



< exp ( — n D 



#2,1 + 72 



1 + 72 



72 



1+72 



(2.142) 

(2.143) 
(2.144) 



where the equality in (12. 143ft holds due to (12.139|) and (12.133ft . and (12.1441) follows from Theorem with 



A <J 2 £ A 

72 = ^|, fc,l = 



£2,1 

d 2 



(2.145) 



and c?2, <J2 are introduced, respectively, in (|2.14ip and (|2. 142ft . 

From (12.115ft . (12.136ft and (|2.144ft . the exponent of the probability of either having an error or an 
erasure is lower bounded by 



lim > mm D 

ra— >oo n i=i,2 



Sli + 7i 



1 + 7* 



7i 



l + 7« 



(2.146) 



Similarly to the above analysis, one gets from (|2.116ft and (|2. 134ft that the error exponent is lower bounded 
by 



lim - 

n— >oo 



hi P e ( n 

— > min D 

n i=l,2 



8i,2 + 7i 



1 + 7; 



7i 



l + 7i 



where 



c A c A 

"1,2 — ~~T~ > d 2,2 — — T~ 



(2.147) 



(2.148) 



For the case of a single threshold (i.e., A = A = A) then (|2. 146ft and (|2.147ft coincide, and one obtains 
that the error exponent satisfies 



lim - 

n— >oo 



In P. 



ii 



> min D 



Si + 7i 



l + 7i 



(2.149) 



i=l,2 V 1 + 7i 

where 5i is the common value of 6n and 8i 2 (f° r * = 1>2). In this special case, the zero threshold is 
optimal (see, e.g., |84[ p. 93]), which then yields that (12.149ft is satisfied with 



. D(P X \\P 2 ) . D(P 2 \\Pi) 

Ol = : , 2 = 



(2.150) 



d\ d 2 

with d\ and c?2 from (|2. 131ft and (|2. 141ft . respectively. The right-hand side of ([2.149ft forms a lower bound 
on Chernoff information which is the exact error exponent for this special case. 
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Comparison of the Lower Bounds on the Exponents with those that Follow from Azuma's 
Inequality 

The lower bounds on the error exponent and the exponent of the probability of having either errors or 
erasures, that were derived in the previous subsection via Theorem [5j are compared in the following to 
the loosened lower bounds on these exponents that follow from Azuma's inequality. 

We first obtain upper bounds on a4 , ai 2 ' , Pn and /3n^ via Azuma's inequality, and then use them 
to derive lower bounds on the exponents of Pi,n and Pefn ■ 

From ([2+30]) . ([2731T) . (^+35]) . ([2+3TD . and Azuma's inequality 

a«<exp(-^) (2.151) 



2 

and, similarly, from (I2.14UI) . (12.1411) . (|2.143l) . (|2.145l) . and Azuma's inequality 



2 

From (I2TTT21) . (l2~TilD . (12+341) . OTlSD and Azuma's inequality 

5\ 2 n 



^^expf-S^V (2.152) 



( 2 )<exp (2.153) 



^(2)< exp (_^V ( 2.i54) 



Therefore, it follows from (|2.115p . (|2.116p and (|2.15ip - (|2.154p that the resulting lower bounds on the 
exponents of pj,n and pP2 are 



In P (i) 5 2 ■ 

lim 2£ > min j = 1,2 (2.155) 

n— >oo n i=l,2 2 

as compared to (|2.146j) and (|2.147p which give, for j = 1, 2, 

(2.156) 



lnP e ( ,n . . +7i 
lim > mm V 



1 + li 



n -i=i,2 V 1 + 7j 

For the specific case of a zero threshold, the lower bound on the error exponent which follows from 
Azuma's inequality is given by 

l im _^£2^ > min _l (2.157) 

n— >oo n £=1,2 2 

with the values of 5\ and 52 in (I2.150p . 

The lower bounds on the exponents in (|2.155p and (|2.156p are compared in the following. Note that 
the lower bounds in (I2.155P are loosened as compared to those in (12.156P since they follow, respectively, 
from Azuma's inequality and its improvement in Theorem [5j 

The divergence in the exponent of (I2.156P is equal to 



D 



Si,j + 7i 



1 + 7 



li 



1 + 7; 



1 + 7i / V li J V 1 + 7 



7* 



1 + 7 



1 + -^- In 11 + ^± + 



7i / v li J 7 



(l-<5 M )ln(l-^ 



(2.158) 
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Lemma 7. 

f u + ue [-1,0] 

(1 + u) ln(l + u) > { \ (2.159) 

{ u+^-f, u>0 

where at u = —1, the left-hand side is defined to be zero (it is the limit of this function when u — > — 1 
from above). 

Proof. The proof relies on some elementary calculus. □ 
Since <5$j 6 [0, 1], then (|2.158p and Lemma [7] imply that 



D( 



V 1 + 7* 



7i \ > JJ *j ^_ (2160) 



l + 7i ;-2 7i 67?(l + 7i ) 



Hence, by comparing (|2.155p with the combination of (|2.156[) and f|2. 16Q|) . then it follows that (up to a 
second-order approximation) the lower bounds on the exponents that were derived via Theorem [5] are 
improved by at least a factor of (max7j) 1 as compared to those that follow from Azuma's inequality. 

Example 11. Consider two probability measures Pi and P2 where 

P(0) = P 2 (l) = 0.4, Px(l) = P 2 (0) = 0.6, 

and the case of a single threshold of the log-likelihood ratio that is set to zero (i.e., A = 0). The exact 
error exponent in this case is Chernoff information that is equal to 

C(P 1 ,P 2 ) = 2.04- 10~ 2 . 

The improved lower bound on the error exponent in ()2.149f) and (12.150j) is equal to 1.77 • 10 -2 , whereas 
the loosened lower bound in f|2.15T|) is equal to 1.39 • 10 -2 . In this case 7i = § and 72 = |, so the 
improvement in the lower bound on the error exponent is indeed by a factor of approximately 



_ 9 

V~r~ '-) ~ 7' 



( max7j 



Note that, from (12. 136[) . (I2.144p and (I2.15ip -( l2.154p . these are lower bounds on the error exponents for 
any finite block length n, and not only asymptotically in the limit where n — > 00. The operational 
meaning of this example is that the improved lower bound on the error exponent assures that a fixed 
error probability can be obtained based on a sequence of i.i.d. RVs whose length is reduced by 22.2% as 
compared to the loosened bound which follows from Azuma's inequality. 



Comparison of the Exact and Lower Bounds on the Error Exponents, Followed by a Relation 
to Fisher Information 

In the following, we compare the exact and lower bounds on the error exponents. Consider the case 
where there is a single threshold on the log- likelihood ratio (i.e., referring to the case where the erasure 
option is not provided) that is set to zero. The exact error exponent in this case is given by the Chernoff 
information (see (|2.126p ). and it will be compared to the two lower bounds on the error exponents that 
were derived in the previous two subsections. 

Let {Pg}geo, denote an indexed family of probability mass functions where denotes the parameter 
set. Assume that P$ is differentiable in the parameter 9. Then, the Fisher information is defined as 

2 

(2.161) 



AO) = 



d_ 

86 



\nP (x) 
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where the expectation is w.r.t. the probability mass function Pq. The divergence and Fisher information 
are two related information measures, satisfying the equality 

to {e-e'f =— (2 - 162) 

(note that if it was a relative entropy to base 2 then the right-hand side of (12.162j) would have been 
divided by In 2, and be equal to ^ as in [Ml Eq. (12.364)]). 

Proposition 2. Under the above assumptions, 

The Chernoff information and Fisher information are related information measures that satisfy the equal- 
ity 

e'^e (6-6') 2 8 v ; 



E h (Pe,Pe') = rnmD[ u i^ (2.164) 

1 + 7;/ 



Let 

a • n Mi + 7i 
mm V 

i=i,2 V 1 + 7i 

be the lower bound on the error exponent in (12.1490 which corresponds to P\ = Pq and Pi = Pq> , then 
also 

,, m JWff = m - (2.165) 

• Let 

5 2 

E L (P9,Pe') = mm^r (2.166) 

1=1,2 Z 

be the loosened lower bound on the error exponent in (|2.157p which refers to Pi = Pq and P 2 — Pq' ■ 
Then, 

E L (Pe,Pe>) _ a(6)J{6) 
'f 

for some deterministic function a bounded in [0, 1], and there exists an indexed family of probability mass 
functions for which a(9) can be made arbitrarily close to zero for any fixed value of 6 E 0. 



iS, (ft-(V\2 = 8 ( 2 - 167 ) 



Proof. See Appendix ESJ □ 

Proposition [2] shows that, in the considered setting, the refined lower bound on the error exponent 
provides the correct behavior of the error exponent for a binary hypothesis testing when the relative 
entropy between the pair of probability mass functions that characterize the two hypotheses tends to 
zero. This stays in contrast to the loosened error exponent, which follows from Azuma's inequality, 
whose scaling may differ significantly from the correct exponent (for a concrete example, see the last part 
of the proof in Appendix 12. Cp . 

Example 12. Consider the index family of probability mass functions defined over the binary alphabet 
X = {0,1}: 

P e (0) = l-e, P 9 (l) = 0, V0E(O,1). 
From (|2.16ip . the Fisher information is equal to 

m = \+ 1 



i-e 

and, at the point 6 = 0.5, J{6) = 4. Let 6>i = 0.51 and 6 2 = 0.49, so from (I2~T63|) and (I2T65D 

C(P 01 ,P 02 ),E L (P 01 ,P, 2 ) « WWi -W 2 = 2.00 • HT 4 . 

o 

Indeed, the exact values of C(Pq 1 ,Pq 2 ) and E^ (Pq 1 , Pq 2 ) are 2.000 • 10~ 4 and 1.997 • 10~ 4 , respectively. 
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2.6.2 Minimum distance of binary linear block codes 

Consider the ensemble of binary linear block codes of length n and rate R. The average value of the 
normalized minimum distance is equal to 

E1.WQ] 

n z \ / 

where 1 designates the inverse of the binary entropy function to the base 2, and the expectation is 
with respect to the ensemble where the codes are chosen uniformly at random (see pLOOj ) . 

Let H designate an n(l — R) x n parity-check matrix of a linear block code C from this ensemble. 
The minimum distance of the code is equal to the minimal number of columns in H that are linearly 
dependent. Note that the minimum distance is a property of the code, and it does not depend on the 
choice of the particular parity-check matrix which represents the code. 

Let us construct a martingale sequence Xq, . . . , X n where Xi (for i = 0, 1, . . . , n) is a RV that denotes 
the minimal number of linearly dependent columns of a parity-check matrix that is chosen uniformly 
at random from the ensemble, given that we already revealed its first i columns. Based on Remarks [2] 
and [3j this sequence forms indeed a martingale sequence where the associated filtration of the cr-algebras 
Fq C T\ C . . . C T n is defined so that T{ (for % = 0, 1, . . . , n) is the a- algebra that is generated by all the 
sub-sets of n(l — R) x n binary parity-check matrices whose first i columns are fixed. This martingale 
sequence satisfies \X{ — < 1 for i = 1, . . . ,n (since if we reveal a new column of H, then the minimal 

number of linearly dependent columns can change by at most 1). Note that the RV Xq is the expected 
minimum Hamming distance of the ensemble, and X n is the minimum distance of a particular code from 
the ensemble (since once we revealed all the n columns of H, then the code is known exactly). Hence, by 
Azuma's inequality 

n\dmm(C)-E[d min {C)}\ > ay^) < 2exp (-y) , Va > 0. 

This leads to the following theorem: 

Theorem 10. [The minimum distance of binary linear block codes] Let C be chosen uniformly 
at random from the ensemble of binary linear block codes of length n and rate R. Then for every a > 0, 
with probability at least 1 — 2exp ^— ^-j , the minimum distance of C is in the interval 

[n 1 (1 — R) — ay/n, n 1 (1 — R) + ay/n\ 
and it therefore concentrates around its expected value. 

Note, however, that some well-known capacity-approaching families of binary linear block codes pos- 
sess a minimum Hamming distance which grows sub-linearly with the block length n. For example, the 
class of parallel concatenated convolutional (turbo) codes was proved to have a minimum distance which 
grows at most like the logarithm of the interleaver length [101 j . 

2.6.3 Concentration of the cardinality of the fundamental system of cycles for LDPC 
code ensembles 

Low-density parity-check (LDPC) codes are linear block codes that are represented by sparse parity-check 
matrices [102J. A sparse parity-check matrix enables to represent the corresponding linear block code 
by a sparse bipartite graph, and to use this graphical representation for implementing low-complexity 
iterative message-passing decoding. The low-complexity decoding algorithms used for LDPC codes and 
some of their variants are remarkable in that they achieve rates close to the Shannon capacity limit for 
properly designed code ensembles (see, e.g., |13j). As a result of their remarkable performance under 
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practical decoding algorithms, these coding techniques have revolutionized the field of channel coding 
and they have been incorporated in various digital communication standards during the last decade. 

In the following, we consider ensembles of binary LDPC codes. The codes are represented by bipartite 
graphs where the variable nodes are located on the left side of the graph, and the parity-check nodes are 
on the right. The parity-check equations that define the linear code are represented by edges connecting 
each check node with the variable nodes that are involved in the corresponding parity-check equation. 
The bipartite graphs representing these codes are sparse in the sense that the number of edges in the 
graph scales linearly with the block length n of the code. Following standard notation, let Aj and pi 
denote the fraction of edges attached, respectively, to variable and parity-check nodes of degree i. The 
LDPC code ensemble is denoted by LDPC(n, A, p) where n is the block length of the codes, and the pair 
X(x) = Y^i AiX 1-1 and p(x) = ^ PiX % ~ x represents, respectively, the left and right degree distributions of 
the ensemble from the edge perspective. For a short summary of preliminary material on binary LDPC 
code ensembles see, e.g., [1031 Section II- A]. 

It is well known that linear block codes which can be represented by cycle-free bipartite (Tan- 
ner) graphs have poor performance even under ML decoding |104j . The bipartite graphs of capacity- 
approaching LDPC codes should therefore have cycles. For analyzing this issue, we focused on the notion 
of "the cardinality of the fundamental system of cycles of bipartite graphs" . For the required preliminary 
material, the reader is referred to [1031 Section II-E]. In [103|, we address the following question: 

Question: Consider an LDPC ensemble whose transmission takes place over a memoryless binary-input 
output symmetric channel, and refer to the bipartite graphs which represent codes from this ensemble 
where every code is chosen uniformly at random from the ensemble. How does the average cardinality 
of the fundamental system of cycles of these bipartite graphs scale as a function of the achievable gap to 
capacity ? 

In light of this question, an information-theoretic lower bound on the average cardinality of the 
fundamental system of cycles was derived in [103, Corollary 1]. This bound was expressed in terms of 
the achievable gap to capacity (even under ML decoding) when the communication takes place over a 
memoryless binary-input output-symmetric channel. More explicitly, it was shown that if e designates 
the gap in rate to capacity, then the number of fundamental cycles should grow at least like log -. Hence, 
this lower bound remains unbounded as the gap to capacity tends to zero. Consistently with the study in 
[104j on cycle-free codes, the lower bound on the cardinality of the fundamental system of cycles in [103|. 
Corollary 1] shows quantitatively the necessity of cycles in bipartite graphs which represent good LDPC 
code ensembles. As a continuation to this work, we present in the following a large-deviations analysis 
with respect to the cardinality of the fundamental system of cycles for LDPC code ensembles. 

Let the triple (n, A, p) represent an LDPC code ensemble, and let Q be a bipartite graph that cor- 
responds to a code from this ensemble. Then, the cardinality of the fundamental system of cycles of G, 
denoted by (3(G), is equal to 

m = \E(0)\-\V(Q)\+c(G) 

where E(G), V(G) and c(G) denote the edges, vertices and components of G, respectively, and \A\ denotes 
the number of elements of a (finite) set A. Note that for such a bipartite graph G, there are n variable 
nodes and m = n(l — i?d) parity-check nodes, so there are in total |V({?)| = n(2 — R^) nodes. Let or, 
designate the average right degree (i.e., the average degree of the parity-check nodes), then the number 
of edges in G is given by = mdR. Therefore, for a code from the (n, A, p) LDPC code ensemble, 

the cardinality of the fundamental system of cycles satisfies the equality 

R d )]+c(G) (2.168) 
1 

Jo P( x ) dx 

denote, respectively, the design rate and average right degree of the ensemble. 



(3(G) = n[(l-R d )a R -(2- 

where 

Jo p(x) dx 
.rid — 1 1 , Or 

Jo K x ) dx 
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Let 

E ± \E(G)\ = n(l - R d )a R (2.169) 

denote the number of edges of an arbitrary bipartite graph Q from the ensemble (where we refer inter- 
changeably to codes and to the bipartite graphs that represent these codes from the considered ensemble) . 
Let us arbitrarily assign numbers 1, . . . ,E to the E edges of Q. Based on Remarks [2] and [3l lets construct 
a martingale sequence Xq, . . . ,Xg where (for i = 0, 1, . . . , E) is a RV that denotes the conditional 
expected number of components of a bipartite graph G, chosen uniformly at random from the ensem- 
ble, given that the first i edges of the graph Q are revealed. Note that the corresponding filtration 
Tq C T\ C . . . C Te in this case is defined so that T\ is the cr-algebra that is generated by all the sets of 
bipartite graphs from the considered ensemble whose first % edges are fixed. For this martingale sequence 

X = IE LD p C(ri , AiP) [(3(G)] , X E = p{Q) 

and (a.s.) \Xk — Xk-i | < 1 for k = 1, . . . , E (since by revealing a new edge of G, the number of components 
in this graph can change by at most 1). By Corollary [H it follows that for every a > 

P (\c(G) - E LDPC(niV) [c(S)]| > aE) < 2e~f^ E 
=► P (|/3(S) - E LDPC(nAp) [P(g)}\ > aE) < 2e~f^ E (2.170) 

where the last transition follows from (|2.168p . and the function / was defined in (|2.50p . Hence, for a > 1, 
this probability is zero (since f(a) = +oo for a > 1). Note that, from (|2. 168[) . pLDPC(n,A,p) [^(^)] scales 
linearly with n. The combination of Eqs. (12.50p . (12.169p . (12.170P gives the following statement: 

Theorem 11. [Concentration result for the cardinality of the fundamental system of cycles] 

Let LDPC(n, X, p) be the LDPC code ensemble that is characterized by a block length n, and a pair of 
degree distributions (from the edge perspective) of A and p. Let Q be a bipartite graph chosen uniformly 
at random from this ensemble. Then, for every a > 0, the cardinality of the fundamental system of cycles 
of G, denoted by /3(G), satisfies the following inequality: 

P (|/3(g) - E LDPC(niAjp) [/3(g)]| > an) < 2 • 2-M 2 (V)]« 

where designates the binary entropy function to the base 2, r\ = nzrfi^j QR > an d Rd an d or designate, 
respectively, the design rate and average right degree of the ensemble. Consequently, if rj > 1, this 
probability is zero. 

Remark 12. The loosened version of Theorem II 11 which follows from Azuma's inequality, gets the form 

P (\(3(G) - E LDPC{nXp) [(3(G)}\ > an) < 2e~^ 

for every a > 0, and rj as defined in Theorem [TTJ Note, however, that the exponential decay of the two 
bounds is similar for values of a close to zero (see the exponents in Azuma's inequality and Corollary Q] 
in Figure l2"7ij) . 

Remark 13. For various capacity-achieving sequences of LDPC code ensembles on the binary erasure 
channel, the average right degree scales like log ^ where e denotes the fractional gap to capacity under 
belief-propagation decoding (i.e., i?d = (1 — e)C) [35J. Therefore, for small values of a, the exponential 
decay rate in the inequality of Theorem [TT1 scales like (log i) 2 . This large-deviations result complements 
the result in [103|, Corollary 1] which provides a lower bound on the average cardinality of the fundamental 
system of cycles that scales like log - . 

Remark 14. Consider small deviations from the expected value that scale like y/n. Note that Corollary[T] 
is a special case of Theorem [5] when 7 = 1 (i.e., when only an upper bound on the jumps of the martingale 
sequence is available, but there is no non-trivial upper bound on the conditional variance). Hence, it 
follows from Proposition[T]that Corollary [1] does not provide in this case any improvement in the exponent 
of the concentration inequality (as compared to Azuma's inequality) when small deviations are considered. 
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Figure 2.2: Message flow neighborhood of depth 1. In this figure (I, W, d v = L,d c = R) = (1, 1, 2, 3) 

2.6.4 Concentration Theorems for LDPC Code Ensembles over ISI channels 

Concentration analysis on the number of erroneous variable-to-check messages for random ensembles of 
LDPC codes was introduced in |36j and |105] for memoryless channels. It was shown that the performance 
of an individual code from the ensemble concentrates around the expected (average) value over this 
ensemble when the length of the block length of the code grows and that this average behavior converges 
to the behavior of the cycle-free case. These results were later generalized in [106] for the case of 
intersymbol-interference (ISI) channels. The proofs of |106t Theorems 1 and 2], which refer to regular 
LDPC code ensembles, are revisited in the following in order to derive an explicit expression for the 
exponential rate of the concentration inequality. It is then shown that particularizing the expression for 
memoryless channels provides a tightened concentration inequality as compared to [36] and [105 j . The 
presentation in this subsection is based on a recent work by Ronen Eshel [107J. 

The ISI Channel and its message-passing decoding 

In the following, we briefly describe the ISI channel and the graph used for its message-passing decoding. 
For a detailed description, the reader is referred to p.06j. Consider a binary discrete-time ISI channel 
with a finite memory length, denoted by I . The channel output Yj at time instant j is given by 

I 

Y j = Y,h i X j ^ l +Nj, VjGZ 

i=0 

where {Xj} is the binary input sequence (Xj G {+1, —1}), {/ij}f =0 refers to the input response of the ISI 
channel, and {Nj} ~ N(0,a 2 ) is a sequence of i.i.d. Gaussian random variables with zero mean. It is 
assumed that an information block of length k is encoded by using a regular (n, d v ,d c ) LDPC code, and 
the resulting n coded bits are converted to the channel input sequence before its transmission over the 
channel. For decoding, we consider the windowed version of the sum-product algorithm when applied 
to ISI channels (for specific details about this decoding algorithm, the reader is referred to [106] and 
[108] : in general, it is an iterative message-passing decoding algorithm). The variable-to-check and check- 
to-variable messages are computed as in the sum-product algorithm for the memoryless case with the 
difference that a variable node's message from the channel is not only a function of the channel output 
that corresponds to the considered symbol but also a function of 2W neighboring channel outputs and 
2W neighboring variables nodes as illustrated in Fig. 12.21 
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Concentration 

It is proved in this sub-section that for a large n, a neighborhood of depth £ of a variable-to-check node 
message is tree- like with high probability. Using the Azuma-Hoeffding inequality and the later result, 
it is shown that for most graphs and channel realizations, if s is the transmitted codeword, then the 
probability of a variable-to-check message being erroneous after I rounds of message-passing decoding is 
highly concentrated around its expected value. This expected value is shown to converge to the value of 
p^\s) which corresponds to the cycle-free case. 

In the following theorems, we consider an ISI channel and windowed message-passing decoding al- 
gorithm, when the code graph is chosen uniformly at random from the ensemble of the graphs with 
variable and check node degree d v and d c , respectively. Let Afjp denote the neighborhood of depth i of 

an edge e = (v, c) between a variable-to-check node. Let Nc , iV v and iv| denote, respectively, the 
total number of check nodes, variable nodes and code related edges in this neighborhood. Similarly, let 
Ny ^ denote the number of variable-to-check node messages in the directed neighborhood of depth I of a 
received symbol of the channel. 

Theorem 12. [Probability of a neighborhood of depth t of a variable-to-check node message 
to be tree-like for channels with ISI] Let P-^ = Pr |a/"J^ not a tree j denote the probability that 

the sub-graph Njp is not a tree (i.e., it does not contain cycles). Then, there exists a positive constant 
7 = "y(d v , d c ,£) that does not depend on the block-length n such that P ? < ^ More explicitly, one can 



choose 7 (civ, d c , I) 



N 



W\2 



Proof. This proof forms a straightforward generalization of the proof in [36j (for binary-input output- 
symmetric memoryless channels) to binary-input ISI channels. A detailed proof is available in [107] . □ 

The following concentration inequalities follow from Theorem [12] and the Azuma-Hoeffding inequality: 

Theorem 13. [Concentration of the number of erroneous variable-to-check messages for 
channels with ISI] Let s be the transmitted codeword. Let Z^(s) be the number of erroneous variable- 
to-check messages after I rounds of the windowed message-passing decoding algorithm when the code 
graph is chosen uniformly at random from the ensemble of the graphs with variable and check node 
degrees d v and ci c , respectively. Let p^\s) be the expected fraction of incorrect messages passed through 
an edge with a tree-like directed neighborhood of depth I. Then, there exist some positive constants j3 
and 7 that do not depend on the block-length n such that 

[Concentration around expectation] For any e > 



zW(s) E[ZW(«) 



nd v 



> e/2 < 2e 



(2.171) 



[Convergence of expectation to the cycle-free case] For any e > and n > we have a.s. 



nd. 



< e/2. 



(2.172) 



[Concentration around the cycle-free case] For any e > and n > 



20 



nd v 



> e < 2e 



(2.173) 
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More explicitly, it holds for 



dl 



and 



^j(d v ,d c ,t) = (N^) 2 + (^.Npy 



Proof. From the triangle inequality, we have 



> e 



< 



nd v 

Z®(s) E[ZW(s)] 



ndv 



>£/2 + 



nd v 



> e/2 



(2.174) 



If inequality (12.1721) holds a.s., then P 



> e/2) = 0; therefore, using (|2.174p . we deduce 



that (|2.173p follows from (|2.171|) and (|2.172|) for any e > and n > |. We start by proving (|2.171|) . For 
an arbitrary sequence s, the random variable Z^\s) denotes the number of incorrect variable-to-check 
node messages among all nd v variable-to-check node messages passed in the Ith iteration for a particular 
graph Q and decoder-input Y_. Let us form a martingale by first exposing the nd v edges of the graph 
one by one, and then exposing the n received symbols Yi one by one. Let a denote the sequence of 
the nd v variable-to-check node edges of the graph, followed by the sequence of the n received symbols 
at the channel output. For i = 0,...n(d v + 1), let the RV % = E[Z^{s)\ a\,...ai] be defined as the 
conditional expectation of Z^\s) given the first i elements of the sequence a. Note that it forms a 
martingale sequence (see Remark [2]) where Zq = ¥,[Z^\s)} and Z n (d v +i) = Z^\s). Hence, getting an 
upper bound on the sequence of differences — Z%\ enables to apply the Azuma-Hoeffding inequality 
to prove concentration around the expected value Zq. To this end, lets consider the effect of exposing 
an edge of the graph. Consider two graphs Q and Q whose edges are identical except for an exchange of 
an endpoint of two edges. A variable-to-check message is affected by this change if at least one of these 
edges is included in its directed neighborhood of depth t. 

Consider a neighborhood of depth I of a variable-to-check node message. Since at each level, the 
graph expands by a factor a = (d v — 1 + 2Wd v )(d c — 1) then there are, in total 



ATM = l + d c (d v -l + 2Wd v ) 



e-i 

i=0 



edges related to the code structure (variable-to-check node edges or vice versa) in the neighborhood Mf. 
By symmetry, the two edges can affect at most 2N^ neighborhoods (alternatively, we could directly 
sum the number of variable-to-check node edges in a neighborhood of a variable-to-check node edge and 
in a neighborhood of a check-to- variable node edge). The change in the number of incorrect variable- 
to-check node messages is bounded by the extreme case where each change in the neighborhood of a 
message introduces an error. In a similar manner, when we reveal a received output symbol, the variable- 
to-check node messages whose directed neighborhood include that channel input can be affected. We 
consider a neighborhood of depth I of a received output symbol. By counting, it can be shown that this 
neighborhood includes 



AT- 



e-i 



(2W + l)d v £V 



i=0 
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(i) 

variable-to-check node edges. Therefore, a change of a received output symbol can affect up to N Y 
variable-to-check node messages. We conclude that \Z{ + i — Zi\ < 2Ne for the first nd v exposures, and 



\Z 



(£) 

Zi\ < N^' for the last n exposures. By applying the Azuma-Hoeffding inequality, it follows that 



ZW(«) E[zW(s}] 



> - I < 2exp 



(nd v e/2) s 



nd v nd v 

and a comparison of this concentration inequality to ()2.171j) gives that 



2(nd v {2N^) 2 + n(N^) 2 ) 



1 

1 



Ad v (NP) 2 + (ivf) 2 



(2.175) 



Next, proving inequality (|2.172p relies on concepts from [36] and |106| . Let E[Z^ (s)] (i G {1, . . . , nd v }) 
be the expected number of incorrect messages passed along edge e| after £ rounds, where the average is 
w.r.t. all realizations of graphs and all output symbols from the channel. Then, by the symmetry in the 
graph construction and by the linearity of the expectation, it follows that 

E[ZW(s)}= £ E[zf\s)]=nd Y E[zf\s)]. (2.176) 

ie[nd v ] 

From Bayes rule 

E[zf \s)} = E[zf\s) | Mf is a tree] + E[zf \s) \ Mf ] not a tree] Pf 

As shown in Theorem [T2l P^ < ^ where 7 is a positive constant independent of n. Furthermore, we 
have E[Z^(s) | neighborhood is tree] = p^(s), so 

E[z[ e \s)] < (1 - Ff )pW(g) + Pf < pW(g) + 



E[ZW(?)] > (l-P 7 W )p W (l)>p W (l)-P f W . 



(2.177) 



Using (I2TT76T) . (I2TT77D and P ? (£) < 2 gives that 

E[zW(s)} 



nd v 



< # < 1. 
t n 



Hence, if n > ^, then (I2.172p holds. 



□ 



The concentration result proved above is a generalization of the results given in |36j for a binary- 
input output-symmetric memoryless channel. One can degenerate the expression of ^ in (|2.175p to the 



memoryless case by setting W = and 1 = 0. Since we exact expressions for iV e and N y are used 
in the above proof, one can expect a tighter bound as compared to the earlier result -a— = 544c^ -1 g^ 
given in [36]. For example for (d v ,d c ,£) = (3,4,10), one gets an improvement by a factor of about 
1 million. However, even with this improved expression, the required size of n according to our proof 
can be absurdly large. This is because the proof is very pessimistic in the sense that it assumes that any 
change in an edge or the decoder's input introduces an error in every message it affects. This is especially 
pessimistic if a large i is considered, since as I is increased, each message is a function of many edges and 
received output symbols from the channel (since the neighborhood grows with £). 

The same phenomena of concentration of measures that are proved above for regular LDPC code 
ensembles can be extended to irregular LDPC code ensembles. In the special case of memoryless binary- 
input output-symmetric channels, the following theorem was proved by Richardson and Urbanke in |13|. 
pp. 487-490], based on the Azuma-Hoeffding inequality (we use here the same notation for LDPC code 
ensembles as in the preceding subsection). 
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Theorem 14. [Concentration of the bit error probability around the ensemble average] Let 

C, a code chosen uniformly at random from the ensemble LDPC(n, A, p), be used for transmission over 
a memoryless binary-input output-symmetric (MBIOS) channel characterized by its L-density ombios- 
Assume that the decoder performs I iterations of message-passing decoding, and let Pb{C, QmbioS) 
denote the resulting bit error probability. Then, for every 5 > 0, there exists an a > where a = 
a(X, p, 5, 1) (independent of the block length n) such that 

P (|P b (C,a M Bios,0 - E LDPC(n,A, P )[^b(C,aMBios,0]l > s ) <exp(-an). 

This theorem asserts that all except an exponentially (in the block length) small fraction of codes 
behave within an arbitrary small 5 from the ensemble average (where 5 is a positive number that can be 
chosen arbitrarily small). Therefore, assuming a sufficiently large block length, the ensemble average is 
a good indicator for the performance of individual codes, and it is therefore reasonable to focus on the 
design and analysis of capacity-approaching ensembles (via the density evolution technique). This forms 
a central result in the theory of codes defined on graphs and iterative decoding algorithms. 

2.6.5 On the concentration of the conditional entropy for LDPC code ensembles 

A large deviations analysis of the conditional entropy for random ensembles of LDPC codes was introduced 
in [1091 Theorem 4] and [30|, Theorem 1]. The following theorem is proved in [109^ Appendix I], based 
on the Azuma-Hoeffding inequality, and it is rephrased in the following to consider small deviations of 
order y/n (instead of large deviations of order n): 

Theorem 15. [Concentration of the conditional entropy] Let C be chosen uniformly at random from 
the ensemble LDPC(n, X,p). Assume that the transmission of the code C takes place over a memoryless 
binary-input output-symmetric (MBIOS) channel. Let i?(X|Y) designate the conditional entropy of the 
transmitted codeword X given the received sequence Y from the channel. Then, for any £ > 0, 

P(|fT(X|Y)-E LDPC(n)V) [fr(X|Y)]| > £v^) < 2exp(-££ 2 ) 

where B = 2(d iasx +i) i (i-R d ) > ^c"** * s the maximal check-node degree, and Rd is the design rate of the 
ensemble. 

The conditional entropy scales linearly with n, and this inequality considers deviations from the 
average which also scale linearly with n. 

In the following, we revisit the proof of Theorem 1 151 in [1091 Appendix I] in order to derive a tightened 
version of this bound. Based on this proof, let Q be a bipartite graph which represents a code chosen 
uniformly at random from the ensemble LDPC(n, A, p). Define the RV 

Z = H g (X\Y) 

which forms the conditional entropy when the transmission takes place over an MBIOS channel whose 
transition probability is given by -P Y |x(y| x ) = H^=iPY\x(yi\ x i) where p Y \x(y\l) = PY\x(~y\ Q )- Fix an 
arbitrary order for the m = n(l — Rd) parity-check nodes where Rd forms the design rate of the LDPC 
code ensemble. Let {-7 r t} tg | 1 m \ form a filtration of cr-algebras To C T\ C . . . C T m where Ft (for 
t = 0, 1, . . . , m) is the cr-algebra that is generated by all the sub-sets ofmxn parity-check matrices that 
are characterized by the pair of degree distributions (A, p) and whose first t parity-check equations are 
fixed (for t = nothing is fixed, and therefore J-q = {0, £1} where denotes the empty set, and f2 is the 
whole sample space ofmxn binary parity-check matrices that are characterized by the pair of degree 
distributions (A,p)). Accordingly, based on Remarks [2] and [3l let us define the following martingale 
sequence 

Z t = E[Z\T t ] t G {0,1,..., m}. 
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By construction, Zq = E[iJg(X|Y)] is the expected value of the conditional entropy for the LDPC code 
ensemble, and Z m is the RV that is equal (a.s.) to the conditional entropy of the particular code from 
the ensemble (see Remark [3]). Similarly to |109j Appendix I], we obtain upper bounds on the differences 
\Zt+\ — Zt\ and then rely on Azuma's inequality in Theorem [TJ 

Without loss of generality, the parity-checks are ordered in [1091 Appendix I] by increasing degree. 
Let r = (ri, r2, . . .) be the set of parity-check degrees in ascending order, and Tj be the fraction of parity- 
check nodes of degree i. Hence, the first m\ = n(l — R^)T ri parity-check nodes are of degree n, the 
successive ni2 = n(l — Rd)T r2 parity-check nodes are of degree and so on. The (t + l)th parity-check 
will therefore have a well defined degree, to be denoted by r. From the proof in [1091 Appendix I] 

\Z t+1 -Z t \<{r + l)Hg(X\Y) (2.178) 

where Hg(X\Y) is a RV which designates the conditional entropy of a parity-bit X = X^ © . . . © X{ r (i.e., 
X is equal to the modulo-2 sum of some r bits in the codeword X) given the received sequence Y at the 
channel output. The proof in [109, Appendix I] was then completed by upper bounding the parity-check 
degree r by the maximal parity-check degree d™ ax , and also by upper bounding the conditional entropy 
of the parity-bit X by 1. This gives 

\Z t+ i - Z t \ < d™ x + 1 t = Q,l,...,m-l. (2.179) 

which then proves Theorem 1151 from Azuma's inequality. Note that the d^s in Theorem [T] are equal to 
+ 1, and n in Theorem Q] is replaced with the length m = n(l — R^) of the martingale sequence {Zt} 
(that is equal to the number of the parity-check nodes in the graph). 

In the continuation, we deviate from the proof in [1091 Appendix I] in two respects: 

The first difference is related to the upper bound on the conditional entropy Hg(X\ Y) in ()2.178p where X 
is the modulo-2 sum of some r bits of the transmitted codeword X given the channel output Y. Instead 
of taking the most trivial upper bound that is equal to 1, as was done in [1091 Appendix I], a simple 
upper bound on the conditional entropy is derived; this bound depends on the parity-check degree r and 
the channel capacity C (see Proposition [3]). 

The second difference is minor, but it proves to be helpful for tightening the concentration inequality 
for LDPC code ensembles that are not right-regular (i.e., the case where the degrees of the parity-check 
nodes are not fixed to a certain value). Instead of upper bounding the term r + 1 on the right-hand side 
of (|2.178p with d™ ax + 1, it is suggested to leave it as is since Azuma's inequality applies to the case where 
the bounded differences of the martingale sequence are not fixed (see Theorem [T]), and since the number 
of the parity-check nodes of degree r is equal to n(l — R^)T r . The effect of this simple modification will 
be shown in Example 1141 

The following upper bound is related to the first item above: 

Proposition 3. Let Q be a bipartite graph which corresponds to a binary linear block code whose 
transmission takes place over an MBIOS channel. Let X and Y designate the transmitted codeword and 
received sequence at the channel output. Let X = X,- L1 © ... © Xi r be a parity-bit of some r code bits of 
X. Then, the conditional entropy of X given Y satisfies 

Hg(X\Y) < h 2 (^—f 1 ^ ■ (2- 18 °) 

Further, for a binary symmetric channel (BSC) or a binary erasure channel (BEC), this bound can be 
improved to 

h2 fx -[i-y -c)Y\ 
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and 

1 - C r (2.182) 
respectively, where 1 in (|2.18ip designates the inverse of the binary entropy function on base 2. 

Note that if the MBIOS channel is perfect (i.e., its capacity is C = 1 bit per channel use) then (|2. 18Q|) 
holds with equality (where both sides of (12. 180|) are zero), whereas the trivial upper bound is 1. 

Proof. Since conditioning reduces the entropy, we havetf(X|Y) < H(X\Y h: . . . ,Y ir ). Note that Y h Y ir 
are the corresponding channel outputs to the channel inputs , . . . Xi r , where these r bits are used to 
calculate the parity-bit X. Hence, by combining the last inequality with [103|, Eq. (17) and Appendix I], 
it follows that 



where (see [HH Eq. (19)]) 



H(X\Y)<l--?—Y : (2.183) 

v 1 '- 21n2^p(2p-l) V ' 

p=i ' 



J a(0(l + e-')tanh 2p (^\ all, M p 6 N (2.184) 



and a(-) denotes the symmetric pdfoi the log-likelihood ratio at the output of the MBIOS channel, given 
that the channel input is equal to zero. From [103, Lemmas 4 and 5], it follows that 

g P >C p , VpeN. 

Substituting this inequality in (|2.183p gives that 

2 In 2 ^ p(2p- 1) 



oo 

H(X\Y) < 1 V 



= 2~ J ( 2 - 185 ) 

where the last equality follows from the power series expansion of the binary entropy function: 

h -^ = 1 -^^mri)- (2186) 

p=l v ' 

This proves the result in (|2. 180[) . 

The tightened bound on the conditional entropy for the BSC is obtained from (|2.183j) and the equality 

g p = (l-2/ l2 " 1 (l-C)) 2p , VpeN 

which holds for the BSC (see [1031 Eq. (97)]). This replaces C on the right-hand side of (|2.185p with 
(l - 2h 2 1 (l - C)) 2 , thus leading to the tightened bound in f|2.181|> . 

The tightened result for the BEC follows from (|2.183j) where, from (|2. 184ft . 

g p = C, Vp G N 

(sec [103, Appendix II]). Substituting g p into the right-hand side of (12.1831) gives (12.1811) (note that 
Y^Li P (2p-i) = 2 In 2). This completes the proof of Proposition [3l □ 
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Prom Proposition [3] and (|2.178|) 



\Z t+1 -Z t \ <(r + l)h 2 



I-C2 



(2.187) 



with the corresponding two improvements for the BSC and BEC (where the second term on the right- 
hand side of (|2.187p is replaced by (|2.18ip and (|2.182p . respectively). This improves the loosened bound 
of (d™ ax + l) in |109t Appendix I]. From (|2.187j) and Theorem[I] we obtain the following tightened version 
of the concentration inequality in Theorem 1151 



Theorem 16. [A tightened concentration inequality for the conditional entropy] Let C be 

chosen uniformly at random from the ensemble LDPC(n, A, p). Assume that the transmission of the 
code C takes place over a memoryless binary-input output-symmetric (MBIOS) channel. Let i?(X|Y) 
designate the conditional entropy of the transmitted codeword X given the received sequence Y at the 
channel output. Then, for every £ > 0, 



P(|ff(X|Y) -E LDPC(nAp) [tf(X|Y)]| > £v^) < 2exp(-^ 2 



(2.188) 



where 



D 



2(1 -Si) EST (* + i) 2 r, 



l-C? 



(2.189) 



and <i™ ax is the maximal check-node degree, i?d is the design rate of the ensemble, and C is the channel 
capacity (in bits per channel use). Furthermore, for a binary symmetric channel (BSC) or a binary erasure 
channel (BEC), the parameter B on the right-hand side of (I2.188P can be improved (i.e., increased), 
respectively, to 

B± , I 



2(i-*d)£gN(* + i) 2 r, 



and 



B 



2(i- J R d )Er=i {(i + i^r.a-m 



(2.190) 



Remark 15. From (|2.189p . Theorem 1161 indeed yields a stronger concentration inequality than Theo- 
rem [T5J 

Remark 16. In the limit where C — > 1 bit per channel use, it follows from (|2.189p that if (i™ ax < 00 
then B — > 00. This is in contrast to the value of B in Theorem [15] which does not depend on the 
channel capacity and is finite. Note that B should be indeed infinity for a perfect channel, and therefore 
Theorem 1161 is tight in this case. 

In the case where d™ ax is not finite, we prove the following: 

Lemma 8. If (i™ ax = 00 and p'(l) < 00 then B — > 00 in the limit where C — > 1. 

Proof. See Appendix 12. Dl □ 

This is in contrast to the value of B in Theorem [15] which vanishes when d™ ax = 00, and therefore 
Theorem [15] is not informative in this case (see Example 1 14|) . 
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Example 13. [Comparison of Theorems [15] and [16] for right-regular LDPC code ensembles] In the fol- 
lowing, we exemplify the improvement in the tightness of Theorem [16] for right-regular LDPC code 
ensembles. Consider the case where the communications takes place over a binary-input additive white 
Gaussian noise channel (BIAWGNC) or a BEC. Let us consider the (2, 20) regular LDPC code ensemble 
whose design rate is equal to 0.900 bits per channel use. For a BEC, the threshold of the channel bit 
erasure probability under belief-propagation (BP) decoding is given by 



PBP 



inf ; — 

x6(0,l] 1 - (1 



X 



,19 



0.0531 



which corresponds to a channel capacity of C = 0.9469 bits per channel use. For the BIAWGNC, the 
threshold under BP decoding is equal to cjbp = 0.4156590. From |13[ Example 4.38] which expresses the 
capacity of the BIAWGNC in terms of the standard deviation a of the Gaussian noise, the minimum 
capacity of a BIAWGNC over which it is possible to communicate with vanishing bit error probability 
under BP decoding is C = 0.9685 bits per channel use. Accordingly, let us assume that for reliable 
communications on both channels, the capacity of the BEC and BIAWGNC is set to 0.98 bits per 
channel use. 

Since the considered code ensembles is right-regular (i.e., the parity-check degree is fixed to d c = 20), 
then B in Theorem [16] is improved by a factor of 



dc 
-CT 



-I 2 



5.134. 



This implies that the inequality in Theorem [TU] is satisfied with a block length that is 5.134 times shorter 
than the block length which corresponds to Theorem [TBI For the BEC, the result is improved by a factor 
of 

1 

* = 9.051 

(l-C*) 2 

due to the tightened value of B in (12. 19QH as compared to Theorem [T5l 

Example 14. [Comparison of Theorems 1151 and 1161 for a heavy-tail Poisson distribution (Tornado codes)] 
In the following, we compare Theorems [15] and [16] for Tornado LDPC code ensembles. This capacity- 
achieving sequence for the BEC refers to the heavy-tail Poisson distribution, and it was introduced in 
[35] Section IV], jllQj (see also [T3J Problem 3.20]). We rely in the following on the analysis in |103[ 
Appendix VI]. 

Suppose that we wish to design Tornado code ensembles that achieve a fraction 1 — s of the capacity 
of a BEC under iterative message-passing decoding (where e can be set arbitrarily small). Let p designate 
the bit erasure probability of the channel. The parity-check degree is Poisson distributed, and therefore 
the maximal degree of the parity-check nodes is infinity. Hence, B = according to Theorem [T5l and 
this theorem therefore is useless for the considered code ensemble. On the other hand, from Theorem [TBI 



^(, + i) 2 r, 



(a) . 

< £(* + i) 2 r, 



i- a 



(b) EiPi(i + 2) 
Jo p(x) dx 



+ 1 



(c) 



(p'(l) + 3K vg + l 
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(iWI + 3 | Cl + 1 



A 



2 



( e ) / 1 

< ( — + 3 ) <rr : + i 



(f) 



o(V(i)) 



where inequality (a) holds since the binary entropy function on base 2 is bounded between zero and one, 
equality (b) holds since 



i 



Jq p(x) dx 

where Tj and pi denote the fraction of parity-check nodes and the fraction of edges that are connected 
to parity-check nodes of degree i respectively (and also since Yli^i = 1)j equality (c) holds since 



Jo P( x ) dx 

where dc Vg denotes the average parity-check node degree, equality (d) holds since A'(0) = A2, inequality (e) 
is due to the stability condition for the BEC (where pA' (0)p' (1) < 1 is a necessary condition for reliable 
communication on the BEC under BP decoding), and finally equality (f) follows from the analysis in 
[1031 Appendix VI] (an upper bound on A2 is derived in [103, Eq. (120)], and the average parity-check 
node degree scales like log -). Hence, from the above chain of inequalities and (|2.189p . it follows that for 
a small gap to capacity, the parameter B in Theorem [16] scales (at least) like 




Theorem [16] is therefore useful for the analysis of this LDPC code ensemble. As is shown above, the 
parameter B in (|2.189j) tends to zero rather slowly as we let the fractional gap e tend to zero (which 
therefore demonstrates a rather fast concentration in Theorem [TBI) . 



Example 15. This Example forms a direct continuation of Example 1131 for the (n,d v ,d c ) regular LDPC 
code ensembles where d v = 2 and d c = 20. With the settings in this example, Theorem 1151 gives that 

P(|tf(X|Y) - E LDPC(n)V) [#(X|Y)]| > £V^) < 2exp(-0.0113£ 2 ), V£ > 0. (2.191) 

As was mentioned already in Example 1 13[ the exponential inequalities in Theorem [TBI achieve an improve- 
ment in the exponent of Theorem 1 151 by factors 5.134 and 9.051 for the BIAWGNC and BEC, respectively. 
One therefore obtains from the concentration inequalities in Theorem 1161 that, for every £ > 0, 

f 2exp(-0.0580£ 2 ), (BIAWGNC) 
P(|ff(X|Y)-E LD p C(niAiP) [#(X|Y)]| >t^) < , , ^ • (2-192) 

I 2 exp -0.1023 £ 2 , BEC 



2.6.6 Expansion of random regular bipartite graphs 

Azuma's inequality is useful for analyzing the expansion of random bipartite graphs. The following 
theorem was introduced in [371 Theorem 25]. It is stated and proved here slightly more precisely, in the 
sense of characterizing the relation between the deviation from the expected value and the exponential 
convergence rate of the resulting probability. 
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Theorem 17. [Expansion of random regular bipartite graphs] Let Q be chosen uniformly at 
random from the regular ensemble LDPC(n, x l_1 , x r ~ 1 ). Let a G (0, 1) and 5 > be fixed. Then, with 
probability at least 1 — exp(— 5n), all sets of an variables in Q have a number of neighbors that is at least 



n 



1(1 -(i-«r) 



^2la (h(a) + 8) 



(2.193) 



where h is the binary entropy function to the natural base (i.e., h(x) = — xln(x) — (1 — x) ln(l — x) for 
x€[0,l]). 

Proof. The proof starts by looking at the expected number of neighbors, and then exposing one neighbor 
at a time to bound the probability that the number of neighbors deviates significantly from this mean. 
Note that the number of expected neighbors of an variable nodes is equal to 

n/(l-(l-a) r ) 
r 

since for each of the check nodes, the probability that it has at least one edge in the subset of na chosen 
variable nodes is 1 — (1 — a) r . Let us form a martingale sequence to estimate, via Azuma's inequality, the 
probability that the actual number of neighbors deviates by a certain amount from this expected value. 

Let V denote the set of na nodes. This set has nal outgoing edges. Let us reveal the destination of 
each of these edges one at a time. More precisely, let Si be the RV denoting the check-node socket which 
the i-th edge is connected to, where i G {1, . . . ,nal}. Let X(Q) be a RV which denotes the number of 
neighbors of a chosen set of na variable nodes in a bipartite graph Q from the ensemble, and define for 
i € {0, . . . , nal} 

x i = E[x(g)\s 1 ,...,s i ^ 1 }. 

Note that it is a martingale sequence where Xq = E[A(£/)] and X na \ = X(Q). Also, for every i 6 
{1, . . . , nal}, we have |Xj — < 1 since every time only one check-node socket is revealed, so the 

number of neighbors of the chosen set of variable nodes cannot change by more than 1 at every single 
time. Thus, by the one-sided Azuma's inequality in Section 12.2.11 

i \2 

F(E[X(g)} - X(g) > XVlam) < exp(— -), VA > 0. 

Since there are (J^) choices for the set V then, from the union bound, the event that there exists a set of 
size na whose number of neighbors is less than E[X(C/)] — \Vlan occurs with probability that is at most 
Uexp(-f). 



Since (J^) < e nh ( a \ then we get the loosened bound exp(re/i(a) — 4r) . Finally, choosing A = J2n(h(a) + 5) 
gives the required result. □ 



2.6.7 Concentration of the crest-factor for OFDM signals 

Orthogonal-frequency-division-multiplexing (OFDM) is a modulation that converts a high-rate data 
stream into a number of low-rate steams that are transmitted over parallel narrow-band channels. OFDM 
is widely used in several international standards for digital audio and video broadcasting, and for wireless 
local area networks. For a textbook providing a survey on OFDM, see e.g. [1111 Chapter 19]. One 
of the problems of OFDM signals is that the peak amplitude of the signal can be significantly higher 
than the average amplitude; for a recent comprehensive tutorial that considers the problem of the high 
peak to average power ratio (PAPR) of OFDM signals and some related issues, the reader is referred to 
|112j . The high PAPR of OFDM signals makes their transmission sensitive to non- linear devices in the 
communication path such as digital to analog converters, mixers and high-power amplifiers. As a result 
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of this drawback, it increases the symbol error rate and it also reduces the power efficiency of OFDM 
signals as compared to single-carrier systems. 

Given an n- length codeword {Xj}^T , a single OFDM baseband symbol is described by 

1 " 1 ' 2irit 

s (t) = _^Xiexp(^-), 0<t<T. (2.194) 

Lets assume that Xo, . . . ,X n -i are complex RVs, and that a.s. \Xi\ = 1 (these RVs should not be 
necessarily independent). Since the sub-carriers are orthonormal over [0,T], then the signal power over 
the interval [0, T] is 1 a.s., i.e., T 

-J \s(t)\ 2 dt = l. (2.195) 

The CF of the signal s, composed of n sub-carriers, is defined as 



CF„(s) = max \s(t)\. (2.196) 

0<t<T 

Commonly, the impact of nonlinearities is described by the distribution of the crest-factor (CF) of the 
transmitted signal [113] . but its calculation involves time-consuming simulations even for a small number 
of sub-carriers. From |1144 Section 4] and |115j . it follows that the CF scales with high probability like 
y/\nn for large n. In [1131 Theorem 3 and Corollary 5], a concentration inequality was derived for the 
CF of OFDM signals. It states that for an arbitrary c > 2.5 

clnlnn\ / 1 \ 



/ ^™ / x r, clnlnn\ „ 

CF n (s) - Vh^l < - =1-0 

V ylnn / 



fin n) 



Remark 17. The analysis used to derive this rather strong concentration inequality (see [1131 Ap- 
pendix C]) requires some assumptions on the distribution of the Xj's (see the two conditions in \113\ 
Theorem 3] followed by [1 13|. Corollary 5]). These requirements are not needed in the following analysis, 
and the derivation of concentration inequalities that are introduced in this subsection are much more 
simple and provide some insight to the problem, though they lead to weaker concentration result than in 
[1131 Theorem 3]. 

In the following, Azuma's inequality and a refined version of this inequality are considered under the 
assumption that {Xj}™~ are independent complex- valued random variables with magnitude 1, attaining 
the M points of an M-ary PSK constellation with equal probability. 

Establishing concentration of the crest-factor via Azuma's inequality 

In the following, Azuma's inequality is used to derive a concentration result. Let us define 

y i = E[CF n (s)|Xo,... J X i _i], i = 0,...,n (2.197) 

Based on a standard construction of martingales, {YJ, J-i}™ =0 is a martingale where Ti is the cx-algebra 
that is generated by the first % symbols (Xq, . . . , Aj_i) in (j2. 194H . Hence, Tq C T\ C . . . C T n is a 
filtration. This martingale has also bounded jumps, and 

\Yi-Yi-x\ < -|= 
Jn 



for i € {1, . . . , n} since revealing the additional i-th coordinate X% affects the CF, as is defined in (|2.196p . 
by at most ^= (see the first part of Appendix 12. Eh . It therefore follows from Azuma's inequality that, for 
every a > 0, 

P(|CF n (s) -E[CF n (s)]| > a) < 2exp (-^) (2.198) 
which demonstrates concentration around the expected value. 
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Establishing concentration of the crest-factor via the refined version of Azuma's inequality 
in Proposition [TJ 

In the following, we rely on Proposition Q] to derive an improved concentration result. For the martingale 
sequence {Yj}™ =0 in (|2.197p . Appendix I2.EI gives that a.s. 

\Yi - y*_i| < 4= > E [(^ " Yi-if^i-l] < ~ (2-199) 

for every i G {1, . . . , n}. Note that the conditioning on the u-algebra Ti—\ is equivalent to the conditioning 
on the symbols X , . . . , -Xj_ 2 , and there is no conditioning for i = 1. Further, let 2T, = \JnYi for < i < n. 
Proposition [1] therefore implies that for an arbitrary a > 

P(|CF n (s)-E[CF„(a)]| > a) 

= p(|y n -y | >a) 

= ¥{\Z n -Z \ > av ^) 

<2exp(4(l + (-L)) (2.200) 

(since 5 = % and 7 = \ in the setting of Proposition [T]) . Note that the exponent in the last inequality 
is doubled as compared to the bound that was obtained in (|2,198p via Azuma's inequality, and the 

term which scales like O^-^j on the right-hand side of (|2.20Q[) is expressed explicitly for finite n (see 

Appendix 12. Ah . 



A concentration inequality via Talagrand's method 

In his seminal paper [7], Talagrand introduced an approach for proving concentration inequalities in 
product spaces. It forms a powerful probabilistic tool for establishing concentration results for coordinate- 
wise Lipschitz functions of independent random variables (see, e.g., [HH Section 2.4.2], [6, Section 4] and 
|7J). This approach is used in the following to derive a concentration result of the crest factor around 
its median, and it also enables to derive an upper bound on the distance between the median and the 
expected value. We provide in the following definitions that will be required for introducing a special 
form of Talagrand's inequalities. Afterwards, this inequality will be applied to obtain a concentration 
result for the crest factor of OFDM signals. 

Definition 3 (Hamming distance). Let x, y be two n-length vectors. The Hamming distance between x 
and y is the number of coordinates where x and y disagree, i.e., 

n 

d H (x,y) =J2 X i^¥vi} 
i=i 

where I stands for the indicator function. 

The following suggests a generalization and normalization of the previous distance metric. 

Definition 4. Let a = (01, ... , a n ) G M" (i.e., a is a non-negative vector) satisfy ||a|| 2 = ^iLi( a *) 2 = 1- 
Then, define 

n 

4(x,y) = J^^h^vi}- 

i=i 

Hence, d H (x,y) = y/nd a {x,y) for a = (^=,.. -,^=)- 

The following is a special form of Talagrand's inequalities ([I], Chapter 4] and [?])• 
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Theorem 18 (Talagrand's inequality). Let the random vector X = (Xi, ■ ■ ■ , X n ) be a vector of indepen- 
dent random variables with Xf~ taking values in a set A^, and let A = YYk=i ^-k- Let / : A — >■ M satisfy 
the condition that, for every x G A, there exists a non-negative, normalized n-length vector a = a(x) 
such that 

/(x)</(y) + ffd a (x,y) > Vy G A (2.201) 
for some fixed value a > 0. Then, for every a > 0, 



P(|/(X) - m| > a) < 4exp 




(2.202) 



where m is the median of f{X) (i.e., P(/(X) < m) > | and P(/(X) > m) > \). The same conclusion in 
(|2.202p holds if the condition in (|2.201|) is replaced by 

/(y)</(x)+ffd (x,y) J VyGA (2.203) 



At this stage, we are ready to apply Talagrand's inequality to prove a concentration inequality for the 
crest factor of OFDM signals. As before, let us assume that Xq,Yq, . . . ,X n -i,Y n -x are i.i.d. bounded 
complex RVs, and also assume for simplicity that \Xi\ = \Yi\ = 1. In order to apply Talagrand's inequality 
to prove concentration, note that 



where 



™ax \s(t;X Q ,...,X n - 1 )\ 
< max I s(t;X , . . . ,X n -i 

0<t<T' 



max I s(t: Yn, . . . , 

0<t<T l 



Y r 



n-1) 



< 



< 



< 



n-1 



s(t;Y , 
j 2nit \ 



Y 



n-1) 



Efizmt\ 
(X < -y i )exp(^-J 

i=0 



i=0 
n-1 



i=0 



2d a (x,y) 



(2.204) 



is a non-negative unit- vector of length n (note that a in this case is independent of x). Hence, Talagrand's 
inequality in Theorem 1181 implies that, for every a > 0, 



|CF n (s) — m n \ > a) < 4exp 



a 
16 



(2.205) 



where m n is the median of the crest factor for OFDM signals that are composed of n sub-carriers. This 
inequality demonstrates the concentration of this measure around its median. As a simple consequence 
of (|2.205p . one obtains the following result. 



Corollary 6. The median and expected value of the crest factor differ by at most a constant, indepen- 
dently of the number of sub-carriers n. 
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Proof. By the concentration inequality in (|2.205|) 

|E[CF n (s)]-mJ <E|CF n (s)-m r , 



|CF n (s) — m n \ > a) da 



o 



-L 4exp (-^) da 

□ 

Remark 18. This result applies in general to an arbitrary function / satisfying the condition in (|2.20ip . 
where Talagrand's inequality in (|2.202p implies that (see, e.g., [61 Lemma 4.6]) 

|E[/(X)] -m\ < Aa^. 
Establishing concentration via McDiarmid's inequality 

McDiarmid's inequality (see Theorem [2]) is applied in the following to prove a concentration inequality 
for the crest factor of OFDM signals. To this end, let us define 

U = max \s(t;X Q , . . . ,Xi- 1} Xi, . . . ,X n _i)| 

V = maxJs(t;X , . . ^X'^Xi, . . ,,X n -i)\ 

where the two vectors (Xq, . . . , Xj, . . . , X n -i) and Xq, . . . , X'^, JQ, . . . , X n _i) may only differ in 
their i-th coordinate. This then implies that 

\U-V\< max \s(t;X , . . . ,Xi_i,Xj, . . . ,X n _i) 

0<t<T' 



-s(t; Xq, . . . , Xi_i,Xi, . . . , X, 



n-l. 



1 

max —= 

o<t<T A/n 



[X^-XU)^-^) 



j 2nit \ 



where the last inequality holds since |X_i| = |X^_ 1 | = 1. Hence, McDiarmid's inequality in Theorem [2] 
implies that, for every a > 0, 

P(|CF n (s) - E[CF n (a)]| > a) < 2exp(-y) (2.206) 

which demonstrates concentration of this measure around its expected value. By comparing (|2. 205[) 
with (|2.206p . it follows that McDiarmid's inequality provides an improvement in the exponent. The 
improvement of McDiarmid's inequality is by a factor of 4 in the exponent as compared to Azuma's 
inequality, and by a factor of 2 as compared to the refined version of Azuma's inequality in Proposition [TJ 

To conclude, this subsection derives four concentration inequalities for the crest-factor (CF) of OFDM 
signals under the assumption that the symbols are independent. The first two concentration inequalities 
rely on Azuma's inequality and a refined version of it, and the last two concentration inequalities are 
based on Talagrand's and McDiarmid's inequalities. Although these concentration results are weaker than 
some existing results from the literature (see [1 13] and |115| ). they establish concentration in a rather 
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simple way and provide some insight to the problem. McDiarmid's inequality improves the exponent 
of Azuma's inequality by a factor of 4, and the exponent of the refined version of Azuma's inequality 
from Proposition [JJ by a factor of 2. Note however that Proposition [TJ may be in general tighter than 
McDiarmid's inequality (if 7 < I in the setting of Proposition [T]). It also follows from Talagrand's method 
that the median and expected value of the CF differ by at most a constant, independently of the number 
of sub-carriers. 

2.6.8 Random coding theorems via martingale inequalities 

The following subsection establishes new error exponents and achievable rates of random coding, for 
channels with and without memory, under maximum-likelihood (ML) decoding. The analysis relies on 
some exponential inequalities for martingales with bounded jumps. The characteristics of these coding 
theorems are exemplified in special cases of interest that include non-linear channels. The material in 
this subsection is based on [40], [41] and [42] (and mainly on the latest improvements of these achievable 
rates in [42J). 

Random coding theorems address the average error probability of an ensemble of codebooks as a 
function of the code rate R, the block length N, and the channel statistics. It is assumed that the 
codewords are chosen randomly, subject to some possible constraints, and the codebook is known to the 
encoder and decoder. 

Nonlinear effects are typically encountered in wireless communication systems and optical fibers, which 
degrade the quality of the information transmission. In satellite communication systems, the amplifiers 
located on board satellites typically operate at or near the saturation region in order to conserve energy. 
Saturation nonlinearities of amplifiers introduce nonlinear distortion in the transmitted signals. Similarly, 
power amplifiers in mobile terminals are designed to operate in a nonlinear region in order to obtain 
high power efficiency in mobile cellular communications. Gigabit optical fiber communication channels 
typically exhibit linear and nonlinear distortion as a result of non-ideal transmitter, fiber, receiver and 
optical amplifier components. Nonlinear communication channels can be represented by Volterra models 
[TT61 Chapter 14]. 

Significant degradation in performance may result in the mismatched regime. However, in the fol- 
lowing, it is assumed that both the transmitter and the receiver know the exact probability law of the 
channel. 

We start the presentation by writing explicitly the martingale inequalities that we rely on, derived 
earlier along the derivation of the concentration inequalities in this chapter. 

Martingale inequalities 

• The first martingale inequality that will be used in the following is given in (I2.42f) . It was used earlier in 
this chapter to prove the refinement of the Azuma-Hoeffding inequality in Theorem [5l and it is stated in 
the following as a theorem: 

Theorem 19. Let {N k , J~k}k = oi f° r some n G N, be a discrete-parameter, real-valued martingale with 
bounded jumps. Let 

ik = X k -X k _ u V/c G {1,... ,n} 

designate the jumps of the martingale. Assume that, for some constants d, a > 0, the following two 
requirements 

4 < d, VarfelJfc-i) < a 2 
hold almost surely (a.s.) for every k G {1, . . . , n}. Let 7 = Then, for every t > 0, 
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The second martingale inequality that will be used in the following is similar to (|2.T3[) (while removing 
the assumption that the martingale is conditionally symmetric). It leads to the following theorem: 

Theorem 20. Let {X}~, Fk}u = Q, ^ or some " £ N, be a discrete-time, real-valued martingale with bounded 
jumps. Let 

^k = X k — Xk-i, Vfc G {1,. . . ,n} 

and let m G N be an even number, d > be a positive number, and {/x;}[^ 2 be a sequence of numbers 
such that 

€k < d, 

![(&)' l-fib-i] < m, v;e{2,...,m} 
holds a.s. for every k G {1, . . . , n}. Furthermore, let 



ll 



d> 



W G {2,...,m}. 



Then, for every t > 0, 



E 



cxp 



(*£6 



fe=l 



m—l 



< 



«=2 



Achievable rates under ML decoding 

The goal of this subsection is to derive achievable rates in the random coding setting under ML decoding. 
We first review briefly the analysis in |41| for the derivation of the upper bound on the ML decoding error 
probability. This review is necessary in order to make the beginning of the derivation of this bound more 
accurate, and to correct along the way some inaccuracies that appear in [41, Section II]. After the first 
stage of this analysis, we proceed by improving the resulting error exponents and their corresponding 
achievable rates via the application of the martingale inequalities in the previous subsection. 

Consider an ensemble of block codes C of length N and rate R. Let C G C be a codebook in the 
ensemble. The number of codewords in C is M = \exp(NR)~\ . The codewords of a codebook C are 
assumed to be independent, and the symbols in each codeword are assumed to be i.i.d. with an arbitrary 
probability distribution P. An ML decoding error occurs if, given the transmitted message m and the 
received vector y, there exists another message m' ^ m such that 

||y - -Du m /|| 2 < ||y - -Du m || 2 . 
The union bound for an AWGN channel implies that 

*4.«o <- E Q ( " Pu "C' w " 2 ) 

where the function Q is the complementary Gaussian cumulative distribution function (see (12. lip ). By 
using the inequality Q(x) < \ exp(— for x > 0, it gives the loosened bound (by also ignoring the 
factor of one-half in the bound of Q) 

Pe\m(C) < 2^ eX P ( ^2 ) • 
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At this stage, let us introduce a new parameter p £ [0, 1], and write 



in 

(C) < ^ exp ( 



p\\D\l m - Dllm'Wl 



Note that at this stage, the introduction of the additional parameter p is useless as its optimal value is 
Popt = 1- The average ML decoding error probability over the code ensemble therefore satisfies 



exp 



P \\Dur, 



Du m >\\l 



Sal 



and the average ML decoding error probability over the code ensemble and the transmitted message 
satisfies 



Pe < (M - 1) E 



cxp 



p\\Du-Du\ 



(2.207) 



where the expectation is taken over two randomly chosen codewords u and u where these codewords are 
independent, and their symbols are i.i.d. with a probability distribution P. 

Consider a filtration F$ C T\ C . . . C Tjq where the sub u-algebra T% is given by 



F i ±a{U x ,U x ,...,U i ,U i ), Vi€{l,...,N} 



(2.208) 



for two randomly selected codewords u = (u\, . . . ,Un), and u = (u%, . . . ,un) from the codebook; T\ is 
the minimal u-algebra that is generated by the first i coordinates of these two codewords. In particular, 
let J~o = {0,0} be the trivial u-algebra. Furthermore, define the discrete-time martingale {Xk,J : k}j^ =0 
by 



X k =E[\\Du-Du\\i\T k ] 



(2.209) 



designates the conditional expectation of the squared Euclidean distance between the distorted codewords 
Du and Du given the first i coordinates of the two codewords u and u. The first and last entries of this 
martingale sequence are, respectively, equal to 



X =E[||Du-Du||l], X N = \\Du-Du\\l. 



(2.210) 



Furthermore, following earlier notation, let = X^ — X^-i be the jumps of the martingale, then 



N 



J2%k = X N - X = \\Du- Du\\l-E[\\Du- Du\ 



k=i 



and the substitution of the last equality into (|2.207j) gives that 



_ f pE\\\Du - Du\\\ 2 2i , 
P c < exp(NR) exp -— — — ^ I E 



Sal 



\ fc=i / 



(2.211) 



70 



CHAPTER 2. THE MARTINGALE APPROACH AND APPLICATIONS 



Since the codewords are independent and their symbols are i.i.d., then it follows that 



E||£hi-Du[||! 

N 

= Y J ^[{[Du] k -[Du] k y 



k=l 

N 



£Var([Du] fc -[I>u] fc ) 

k=l 
N 

2]TVar([ J Du] fe ) 



fc=i 

fq-l AT 

2 ( ^ Var ([Du] k ) + ^ Var ([Du] 

k=q 



k=l 



Due to the channel model (see Eq. (|2.228p ) and the assumption that the symbols {uj} are i.i.d., it follows 
that Var ([Z)u]^) is fixed for k = q, . . . , N. Let D V (P) designate this common value of the variance (i.e., 
D V (P) = Var ([Du] k ) for k>q), then 

E||Du-Du|||l = 2 ^^Var ([Du] fc ) + (N - q + 1)D V (P) j . 



Let 



'3-1 



C„(P) 4 exp 



^Var([7Ju] fe )-( g -l) J D v (P) 



. fe=i 



which is a bounded constant, under the assumption that || "U-||oo 7^ <C -)-oo holds a.s. for some T^T > 0, 
and it is independent of the block length N. This therefore implies that the ML decoding error probability 
satisfies 



P e < C P (P) exp 



-N 



pJUp) 



R 



E 



N 



, v P e[o,i] 



(2.212) 



where Z k = —£, k , so {Z k ,J- k } is a martingale-difference that corresponds to the jumps of the martingale 
{— X k , F k }. From (I2,209p . it follows that the martingale-difference sequence {Z k ,F k } is given by 



Z k — X k - 



fc-i 



Xi. 



E[\\Du - Du\\ 2 2 | T k -x] - E[||Du - Du||| | F k 



(2.213) 



For the derivation of improved achievable rates and error exponents (as compared to [2]), the two 
martingale inequalities presented earlier in this subsection are applied to the obtain two possible expo- 
nential upper bounds (in terms of N) on the last term on the right-hand side of f|2.212|) . 

Let us assume that the essential supremum of the channel input is finite a.s. (i.e., ||w||oo is bounded 
a.s.). Based on the upper bound on the ML decoding error probability in (|2.212p . combined with the 
exponential martingale inequalities that are introduced in Theorems Qj)] and [20l one obtains the following 
bounds: 

1. First Bounding Technique: From Theorem 1191 if 

Z k < d, Var(Z fc | T k -l) < & 2 
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holds a.s. for every k > 1, and 72 — ?tj then it follows from (|2.212[) that for every p G [0, 1] 



P C <C P {P) exp<j-iV[^4^- n 



^ n - ^)+72exp(^ 

1+72 



V 



Therefore, the maximal achievable rate that follows from this bound is given by 



KM = max max - m — 



(2.214) 



where the double maximization is performed over the input distribution P and the parameter p G [0, 1]. 
The inner maximization in f|2.214|) can be expressed in closed form, leading to the following simplified 
expression: 



max < 
P 



nl ( ^ 2 1 II _>2_ 

1 U+72 x d(l+72) / M 1+72 



4<rd 



In 



1+72 



if D V (P) < 



/ £1+72)^ 



2 I I+72 exp 



otherwise 



(2.215) 



where 



D(p\\q) ■ ■( ^)+(l-p)]nf |— ^ ) . V/).r/ e iill) 



(2.216) 



denotes the Kullback-Leibler distance (a.k.a. divergence or relative entropy) between the two probability 
distributions (p, 1 — p) and (q, 1 — q). 



2. Second Bounding Technique Based on the combination of Theorem 1201 and Eq. ()2.212p . we derive in the 
following a second achievable rate for random coding under ML decoding. Referring to the martingale- 
difference sequence {Zk^k}t=i m ^1 S - A2.208P and (I2.213D . one obtains from Eq. (12.2121) that if for some 
even number m G N 

Z k <d, E[(Z k ) l \T k -x]<m, WG{2,...,m} 
hold a.s. for some positive constant d > and a sequence {w}^ 2 ' an< ^ 

7/ - ^ V / G {2, . . . , m}, 



then the average error probability satisfies, for every p G [0, 1] 



P c < CJP) exp <^ -N 



R 



"' 1 7 - 7m / pd V / / pd 



lot 



V 



This gives the following achievable rate, for an arbitrary even number m G 



R2(o~,j) = max max 

P pe[o,i] 



4(7,2 



m— 1 



In 



i=2 



8(7 



(2.217) 



where, similarly to (|2.214|) . the double maximization in (|2.217p is performed over the input distribution 
P and the parameter p G [0,1]. 
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Achievable rates for random coding 



In the following, the achievable rates for random coding over various linear and non-linear channels (with 
and without memory) are exemplified. In order to assess the tightness of the bounds, we start with a 
simple example where the mutual information for the given input distribution is known, so that its gap 
can be estimated (since we use here the union bound, it would have been in place also to compare the 
achievable rate with the cutoff rate). 

1. Binary-Input AWGN Channel: Consider the case of a binary- input AWGN channel where 

Y k = U k + v k 

where Ui = ±^4 for some constant A > is a binary input, and ~ N(0,a 2 ) is an additive Gaussian 
noise with zero mean and variance a 2 . Since the codewords U = (U%, . . . ,Un) and U = (Ui, . . . , Un) are 
independent and their symbols are i.i.d., let 

P(U k = A) = P(U k = A) = a, P(U k = -A) = P(U k = -A) = 1 - a 

for some a G [0, 1]. Since the channel is memoryless and the all the symbols are i.i.d. then one gets from 
(|2T208D and (l2T213j) that 



Z k = E[||U - U||l | J- fe _i] - E[||U - U||l | F k \ 
'k-l N 

j=l j=k 



k N 
j=l j=k+l 



= E[(U k - U k f] - (U k - U k f 

= a(l - a){-2A) 2 + a{\ - a)(2A) 2 - (U k - U k ) 2 

= 8a(l - a)A 2 - (U k - U k ) 2 . 

Hence, for every k, 



Z k < 8a(l - a) A 



2 A 



d. 



(2.218) 



Furthermore, for every k, I £ N, due to the above properties 

E[(Z k ) l \T k ^] 
= ^[{Zk) 1 } 

= E[(8a(l-a)A 2 -(U k -U k ) 2 ) 1 

= [l - 2a(l - a)] (8a(l - a)A 2 ) 1 + 2a(l - a) (8a(l - a)A 2 - AA 
and therefore, from (I2.218P and (|2.219j) . for every / £ N 

l-i 



2\l A 



Mi 



| = [1 - 2a(l - a)] 



i + (-iy 



1 - 2a(l - a) 
2a(l - a) 



(2.219) 



(2.220) 



Let us now rely on the two achievable rates for random coding in Eqs. (|2.215|) and (|2.217[) . and apply them 
to the binary-input AWGN channel. Due to the channel symmetry, the considered input distribution is 
symmetric (i.e., a = | and P = (|, |)). In this case, we obtain from (|2.218|) and (|2.220|) that 



D V (P) = Var(C/ fc ) = A 2 , d = 2A 2 , 



11 



! + (-!)' 



, VZ G N. 



(2.221) 
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Based on the first bounding technique that leads to the achievable rate in Eq. (|2.215p . since the first 
condition in this equation cannot hold for the set of parameters in (I2.221D then the achievable rate in 
this equation is equal to 

A 2 ( A 2 \ 

in units of nats per channel use. Let SNR = 4r designate the signal to noise ratio, then the first achievable 
rate gets the form 

E ' l(SN R) = _ i n cos h [^\ . (2.222) 

It is observed here that the optimal value of p in (|2.215|) is equal to 1 (i.e., p* = 1). 

Let us compare it in the following with the achievable rate that follows from (|2.217p . Let m G N be an 
even number. Since, from ()2.221|) . 7; = 1 for all even values of I G N and 7; = for all odd values of 
/ G N, then 

, , \^ 1i-1m f pd\ l ( ( pd \ pd \ 



21+1 



Since the infinite sum Yli=i (21+iy. * s morio t° mcan y increasing with m (where m is even and 

p G [0, 1]), then from f)2.217[) . the best achievable rate within this form is obtained in the limit where m 
is even and m — > 00. In this asymptotic case one gets 



( a ) 1 f pd \ 2l+1 f r pd\ pd 



^ (2Z + 1)! \8a 2 ) V ^8a 2 J 8a, 

V ySa 2 ) 8a 2 ) V ^8a 2 J 8a 2 ) 



(c) 



cosh 



(H) <™> 



2Z4-1 

where equality (a) follows from (|2.223|) . equality (b) holds since sinh(x) = YlbLo (21+1)! ^ or x e ^' anc ^ 
equality (c) holds since sinh(x) + cosh(x) = exp(x). Therefore, the achievable rate in (|2.217j) gives (from 



— — = — In cosh f — — 



d _ A 2 \ 

R2(<?t) = max ( ^—7 — In cosh, , , 
P e[o,i] V4^2 V4cj2; 

Since the function f(x) = x— lncosh(x) for x £ lis monotonic increasing (note that f'(x) = 1— tanh(x) > 
0), then the optimal value of p G [0, 1] is equal to 1, and therefore the best achievable rate that follows 
from the second bounding technique in Eq. (I2.217P is equal to 
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in units of nats per channel use, and it is obtained in the asymptotic case where we let the even number 
m tend to infinity. Finally, setting SNR = 4r , gives the achievable rate in (|2.222|) , so the first and second 
achievable rates for the binary- input AWGN channel coincide, i.e., 



R[ (SNR) = i?' 2 (SNR) 



SNR 



In cosh 



(2.225) 



Note that this common rate tends to zero as we let the signal to noise ratio tend to zero, and it tends to 
In 2 nats per channel use (i.e., 1 bit per channel use) as we let the signal to noise ratio tend to infinity. 

In the considered setting of random coding, in order to exemplify the tightness of the achievable rate in 
(|2.225p . it is compared in the following with the symmetric i.i.d. mutual information of the binary-input 
AWGN channel. The mutual information for this channel (in units of nats per channel use) is given by 
(see, e.g., [HI Example 4.38 on p. 194]) 



C(SNR) = In 2 + (2 SNR - 1) Q(VSNR) 



2 SNR 



7T 



exp 



SNR \ 
2 ) 



i=l 



■ exp(2i(i + 1) SNR) Q((l + 2%) VSNR) j 



(2.226) 



where the Q-function that appears in the infinite series on the right-hand side of (|2.226p is the comple- 
mentary Gaussian cumulative distribution function in (12. lip . Furthermore, this infinite series has a fast 
convergence where the absolute value of its n-th remainder is bounded by the (n + l)-th term of the 
series, which scales like -\ (due to a basic theorem on infinite series of the form J2neN(~^) n a n wnere 
{a n } is a positive and monotonically decreasing sequence; the theorem states that the n-th remainder of 
the series is upper bounded in absolute value by a n+ \). 

The comparison between the mutual information of the binary-input AWGN channel with a symmetric 
i.i.d. input distribution and the common achievable rate in (|2.225p that follows from the martingale 
approach is shown in Figure [2T3l 




SNFUA7CT 



Figure 2.3: A comparison between the symmetric i.i.d. mutual information of the binary-input AWGN 
channel (solid line) and the common achievable rate in (|2.225p (dashed line) that follows from the mar- 
tingale approach in this subsection. 

From the discussion in this subsection, the first and second bounding techniques in Section 12.6.81 lead 
to the same achievable rate (see (12.225P ) in the setup of random coding and ML decoding where we 
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assume a symmetric input distribution (i.e., P(±A) = ^). But this is due to the fact that, from (|2.22ip . 
the sequence {"fi}i>2 is equal to zero for odd indices of I and it is equal to 1 for even values of I (see 
the derivation of ()2.223p and (|2.224p ). Note, however, that the second bounding technique may provide 
tighter bounds than the first one (which follows from Bennett's inequality) due to the knowledge of {7;} 
for I > 2. 

2. Nonlinear Channels with Memory - Third-Order Volterra Channels: The channel model is first presented 
in the following (see Figure [2^) . We refer in the following to a discrete-time channel model of nonlinear 
Volterra channels where the input-output channel model is given by 

Vi = [Du]i + Ui (2.227) 

where % is the time index. Volterra's operator D of order L and memory q is given by 

L q q 

[Du]i = h + ^2 ^2 ■ ■ ■ ^2 M*i> • • • • ■■ u i~i j - (2.228) 

3=1 ii=0 ij=0 

and v is an additive Gaussian noise vector with i.i.d. entries vi ~ JV(0, o~ 2 ). 

Gaussian noise 
v 



Volterra 
Operator D 




Figure 2.4: The discrete-time Volterra non-linear channel model in Eqs. (12.227P and (I2.228P where the 
channel input and output are {Ui} and {Vj}, respectively, and the additive noise samples {vi}, which are 
added to the distorted input, are i.i.d. with zero mean and variance a 2 . 

Under the same setup of the previous subsection regarding the channel input characteristics, we consider 
next the transmission of information over the Volterra system D\ of order L = 3 and memory q = 2, 
whose kernels are depicted in Table I2TT1 Such system models are used in the base-band representation of 
nonlinear narrow-band communication channels. Due to complexity of the channel model, the calculation 
of the achievable rates provided earlier in this subsection requires the numerical calculation of the pa- 
rameters d and a 2 and thus of 72 for the martingale {Zi, J-i}-^ . In order to achieve this goal, we have to 
calculate \Z{ — Z^\\ and Var(Zi|J r j_i) for all possible combinations of the input samples which contribute 
to the aforementioned expressions. Thus, the analytic calculation of d and 7; increases as the system's 
memory q increases. Numerical results are provided in Figure [231 for the case where a 2 , = I. The new 
achievable rates R^(Di, A, a 2 ) and R2{D\, A, o~ 2 ), which depend on the channel input parameter A, are 
compared to the achievable rate provided in [41| Fig. 2] and are shown to be larger than the latter. 

To conclude, improvements of the achievable rates in the low SNR regime are expected to be ob- 
tained via existing improvements to Bennett's inequality (see |117] and |118j ). combined with a possible 
tightening of the union bound under ML decoding (see, e.g., [119] ). 

2.7 Summary 

This chapter derives some classical concentration inequalities for discrete-parameter martingales with 
uniformly bounded jumps, and it considers some of their applications in information theory and related 
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Table 2.1: Kernels of the 3rd order Volterra system D\ with memory 2 



kernel 


Mo) 


Mi) 


M2) 


h 2 (0,0) 


MM) 


Mo.i) 


value 


1.0 


0.5 


-0.8 


1.0 


-0.3 


0.6 



kernel 


Mo, o,o) 


Mi. i>i) 


Mo,o, i) 


M°, 1,1) 


value 


1.0 


-0.5 


1.2 


0.8 



kernel 


Mo, 1,2) 


value 


0.6 




0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 
A 



Figure 2.5: Comparison of the achievable rates in this subsection R\(Di, A, a^) and R^ (Di,A, a^) (where 
m = 2) with the bound R P (D\, A, a^) of [41, Fig. 2] for the nonlinear channel with kernels depicted in 
Table [27T1 and noise variance a% = 1. Rates are expressed in nats per channel use. 



topics. The first part is focused on the derivation of these refined inequalities, followed by a discussion 
on their relations to some classical results in probability theory. Along this discussion, these inequalities 
are linked to the method of types, martingale central limit theorem, law of iterated logarithm, moderate 
deviations principle, and to some reported concentration inequalities from the literature. The second part 
of this work exemplifies these martingale inequalities in the context of hypothesis testing and information 
theory, communication, and coding theory. The interconnections between the concentration inequalities 
that are analyzed in the first part of this work (including some geometric interpretation w.r.t. some of 
these inequalities) are studied, and the conclusions of this study serve for the discussion on information- 
theoretic aspects related to these concentration inequalities in the second part of this chapter. A recent 
interesting avenue that follows from the martingale-based inequalities that are introduced in this chapter 
is their generalization to random matrices (see, e.g., [15j and |16j). 

2. A Proof of Proposition [I] 

Let {Xk, -^TsjfcLo k e a discrete-parameter martingale. We prove in the following that Theorem [5] implies 

chid. 

Let {Afc, J-^}^_ be a discrete-parameter martingale that satisfies the conditions in Theorem [5j From 
(12341) 

F(\X n - X \ > ay 7 ™) < 2exp (-nD^j + 



+ 7 



1 + 7 



1 1 (2.229) 



where from f|2.35j) 



5 ' A 2" = J_ _ ( 2 _230) 



n 



2.B. ANALYSIS RELATED TO THE MODERATE DEVIATIONS PRINCIPLE IN SECTION ?? 77 



Prom the right-hand side of (j2.229j) 



D 



5' + 7 



1 + 7 

7 



7 



1 + 7 



1 + 7 

i + ' Wi+ s 



7W 7 V v^y V 



(2.231) 



From the equality 



{1 + u) ln(l + u) = u + ^ 



-it 



fe=2 



fc(fc-l) 



1< U < 1 



then it follows from ()2.231|) that for every n > 



nD 



5' + 7 



1 + 7 



7 



<5 2 5 3 (l- 7 ) 1 



1 + 7/ 2 7 
5 2 



67 2 
1 



+ . . . 



n 



, ' 

27 \ X, // 



Substituting this into the exponent on the right-hand side of (|2,229p gives (|2.74p . 



2.B Analysis related to the moderate deviations principle in Sec- 
tion [2X3] 

It is demonstrated in the following that, in contrast to Azuma's inequality, Theorem [5] provides an upper 
bound on 

P > °mp\, Va>0 

which coincides with the exact asymptotic limit in (|2.108p . It is proved under the further assumption that 
there exists some constant d > such that \Xk\ < d a.s. for every k G N. Let us define the martingale 
sequence {5^, ^Fk}k=o wnere 

k 

Sk — '^ / Xi, Tk — cr(Xi, . . . ,X k ) 
i=l 

for every k E {1, . . . , n} with So = and To = {0, J 7 }. 
Analysis related to Azuma's inequality 

The martingale sequence {Sk, Tk}^ =0 has uniformly bounded jumps, where |5& — Sfc_i| = \X)~\ < d a.s. 
for every k S {1, . . . , n}. Hence it follows from Azuma's inequality that, for every a > 0, 



P(\S n \ > an n ) < 2exp 

and therefore 



2d 2 



a 2 



lim n x ~ 2ri InPflSJ > atf) < (2.232) 

n^oo v ' 2d 2 

This differs from the limit in (|2. 108|) where a 2 is replaced by d 2 , so Azuma's inequality does not provide 
the asymptotic limit in (12.108j) (unless a 2 = d 2 , i.e., \Xk\ = d a.s. for every k). 
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Analysis related to Theorem [5] 

The analysis here is a slight modification of the analysis in Appendix 12. Al with the required adaptation 
of the calculations for 77 € (i, 1). It follows from Theorem [5] that, for every a > 0, 



P(|5„| > an 11 ) < 2exp \-nD 



6' + 7 



1 + 7 



where 7 is introduced in (|2.35j) . and 5' in (|2.230p is replaced with 



d 



7 



1 + 7 



(2.233) 



due to the definition of 5 in (|2.35p . Following the same analysis as in Appendix 12 Al it follows that for 
every n G N 



P(|5„| > an v ) < 2exp 
and therefore (since, from (|2.35|) . — 



52^-1 
2 7 



1 + Q(1 " 7) • n-d-?) + 
37d 



a 
'2c72- 



lim n 1 " 217 lnP(|5 n | > an" 1 ) < 
Hence, this upper bound coincides with the exact asymptotic result in (|2.108[) . 



(2.234) 



2.C Proof of Proposition [2] 

The proof of (12.1630 is based on calculus, and it is similar to the proof of the limit in (j2. 162[) that relates 
the divergence and Fisher information. For the proof of (|2.165p . note that 

C(Pg,P el ) > E L (Pe,Pe>) > min - f - X . (2.235) 

1=1,2 [2ji 67/(1 + 7^) J 

The left-hand side of (I2.235P holds since E^ is a lower bound on the error exponent, and the exact value 
of this error exponent is the Chernoff information. The right-hand side of (|2.235p follows from Lemma [7] 

2 

(see (|2.160p ) and the definition of El in (|2.164j) . By definition ji = -4 and 5i = j- where, based on 
([2TT50]) . 

e 1 ±D(Pg\\P g ,), e 2 = D{P' e \\Pe). (2.236) 
The term on the left-hand side of (|2.235p therefore satisfies 

5? 6? 



2 7i 67?(l+7i) 



2af 6o>? + d? 

>4fi-— 

" 2a} \ 3 



so it follows from (|2.235p and the last inequality that 



f 2 / F-d- 
i : :i I 2a, 



C(P e , P e >) > E L {P e , P<y) > min { ^ ( 1 - ) } . (2.237) 
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Based on the continuity assumption of the indexed family {Pe}eee, then it follows from (|2.236j) that 

lim£i = 0, Vie {1,2} 

and also, from (|2.131j) and (|2.14ip with Pi and P2 replaced by Pg and P' d respectively, then 

limdi = 0, ViG {1,2}. 

It therefore follows from (I2.163|) and (|2.237p that 

J M > l im > lim min I —A—\ . (2.238) 

8 ~ 6'-+6 (9-9') 2 ~ e'^e i=i,2{2af{9 - 6') 2 ) v ' 

The idea is to show that the limit on the right-hand side of this inequality is (same as the left-hand 
side), and hence, the limit of the middle term is also jp-. 

lim 1 



2a\{9-9') 2 

(a) D{Pe\\Pe>? 
~ 2a 2 {9 - 9') 2 

(b) J{9) D{P e \\Po>) 
= — — hm 5 

4 e'->0 a{ 

(c) J{9) D(P e \\P e ,) 
lim 



4 E^PeW (in ^-D(Pe\\P e ,) 



(d) J(9) Um D{P e \\P t 



4 ^E^W(ln^) 2 -^||^) 2 

2 Cfl _ flA2 



S hm 



J2 x&x P e (x){ln^y - D(P e \\P e ,) 2 



(f) J{9) 2 , (0 - 0') 2 
= hm 



(g) -7(0) 



(2.239) 



where equality (a) follows from ()2.236p . equalities (b), (e) and (f) follow from (|2.162p . equality (c) follows 
from (|2.132p with Pi = Pg and P2 = Pgi , equality (d) follows from the definition of the divergence, and 
equality (g) follows by calculus (the required limit is calculated by using L'HopitaFs rule twice) and from 
the definition of Fisher information in (|2.16ip . Similarly, also 

Km 4 ~ m 



'2 
SO 

lim min 



2a 2 {9 - 9') 2 



J2 



e'^e 1=1,2 [2a 2 (9 - 9') 2 

{e-e'y 



Hence, it follows from f|2.238j) that limg/^g ~t4~%^ = This completes the proof of (|2.165[) . 



80 



CHAPTER 2. THE MARTINGALE APPROACH AND APPLICATIONS 



We prove now equation (|2~T67D . From (|2.13ip . (j2.14ip . (pT50l) and (pT66l) then 



with £1 and £2 in (|2,236p . Hence, 



E L (P„P,) = min^ 



lim — L „^ e, „ t| - < lim 



2d\{e> 



and from (|2.239j) and the last inequality, it follows that 



< 



lim 



err 



lim 



(2.240) 



It is clear that the second term on the right-hand side of (|2.240[) is bounded between zero and one 
(if the limit exists). This limit can be made arbitrarily small, i.e., there exists an indexed family of 
probability mass functions {Pg}eeo for which the second term on the right-hand side of (|2.240p can 
be made arbitrarily close to zero. For a concrete example, let a £ (0, 1) be fixed, and 6 R + be 
a parameter that defines the following indexed family of probability mass functions over the ternary 
alphabet X = {0, 1,2}: 

^(0) = ^r— ^, Pe{l)=a, P e {2) 



1 + 9 

Then, it follows by calculus that for this indexed family 



lim 



max.^ 



In 



7VM 



D(P e \\P& 



1 + 



(l-a)0 



so, for any 6 E M + , the above limit can be made arbitrarily close to zero by choosing a close enough to 1. 
This completes the proof of (|2.167p . and also the proof of Proposition [2j 



2.D Proof of Lemma [8] 

In order to prove Lemma El one needs to show that if p'(l) < oo then 



Jm^i+l)^ 



i=l 



1 - C5 



(2.241) 



which then yields from (I2.189P that B — > oo in the limit where C — ¥ 1. 

By the assumption in Lemma [8] where p'(l) < oo then W < °°> an d therefore it follows from 

the Cauchy-Schwarz inequality that 



i=i 



> 



v— VOO 

Di=i w 



> 0. 
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Hence, the average degree of the parity-check nodes is finite 

1 



javg 



< OO. 



The infinite sum ^^(2 + l) 2 Tj converges under the above assumption since 



where the last equality holds since 



i=l 



£(< + i) 2 r, 
1 

00 00 

i=l i=l i 

df g (f> i + 2)+l< 



00. 



vi=l 



Jo 1 PO*) dx 

d avg /7*\ Vi G N. 



The infinite series in (|2.24ip therefore uniformly converges for C G [0,1], hence, the order of the limit 
and the infinite sum can be exchanged. Every term of the infinite series in (|2.24ip converges to zero in 
the limit where C — >■ 1, hence the limit in (12.24ip is zero. This completes the proof of Lemma [8j 



2.E Proof of the properties in (12.1991 ) for OFDM signals 



Consider an OFDM signal from Section 12.6.71 The sequence in (|2. 197j) is a martingale due to basic 
properties of martingales. From (12.196p . for every i G {0, . . . , n} 



Y t = E 



max \s(t;X , . . . ,X n _i)| 



Xn 



i-l 



The conditional expectation for the RV refers to the case where only Xo, . . . , Xj_2 are revealed. Let 
X' i _ 1 and be independent copies, which are also independent of Xq, . . . , X^2, -X, . . . , X n _\. Then, 
for every 1 < i < n, 



Yi-i = E 
= E 



max\s(t]X , ■ ■ ■ ,X' i _ 1 ,X i , . . . ,X n _i)| 
^maxjs(t;X , . . . ,X t '_i,Xj, . . . ,X n _i) 



Xq, • • • ) 

Xo, ■ ■ ■ , Xi-2,Xi-\ 



Since |E(Z)| < E(|Z|), then for ie{l,...,n} 



It/- VI 



,X. 



i-l 



(2.242) 



where 



C7 = max \s(t;X , ■ . .,X^i,Xi, . . . ,X n -i) 

0<t<T' 

V = max\s(t', X , . . . ,X l -_ 1) X i) . . . ,X„_i) 
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From (12.1941) 



\U-V\< maxj s(t;X , . . . ,Xj_i,Xi, . . . ,X„_i) - s(t;X , . . . jX^X;, • • ■ ,X n -i) 



max — = 

o<t<T yfn 



[X^ x - XU) exp 



j 2-Kit \ 



\X 



i-l 



By assumption, |Xj_i| = = 1, and therefore a.s. 



(2.243) 



|X t _i - XU\ <2^\Yi- Yi^\ < 



n 



In the following, an upper bound on the conditional variance Var(Yj | J~i-\) = E[(l^ — l^-i) 2 | J^-i] is 
obtained. Since (E(Z)) 2 < E(Z 2 ) for a real- valued RV Z, then from ([2332D and ([23331) 

E[(y< - l-_x) 2 |^_x] < - • E x , [|X t _x - X^) 2 | Fi] 

where is the cr-algebra that is generated by Xq, . . . , Xi-±. Due to symmetry of the PSK constellation, 
then 



E[(y,-y i _ 1 ) 2 | Ji_ x ] 
n >- iL j 



.-1 — ^«'-l| 2 I -^O) • • • )Xj_l] 



-EHx, 

n 

-E[\X l _ l -X' i _ 1 \ 2 \X l -i] 
n 

~E\\X i -. 1 -X' ir _ 1 \ 2 \X i - 1 =e% 
n L 

M-l 

1 V "V I J7T 

> e a/ 

T7 M t-^ I 

«=o 

Af-l 



nM 
4 



j(2i + l)7T 

e « 



nM 



E 

z=i 



sm 



tt/ \ _ 2 
MJ ~ n 



where the last equality holds since 



M-l 

E sin ' 

i=i 

M 1 

Y ~ 2 



7rZ N _ 1 

M/ ~ 2 



M-l 



z=o 



M-l 
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Chapter 3 



The Entropy Method, Log-Sobolev and 
Transportation-Cost Inequalities: Links 
and Applications in Information Theory 



This chapter introduces the entropy method for deriving concentration inequalities for functions of many 
independent random variables, and exhibits its multiple connections to information theory. The chapter 
is divided into four parts. The first part of the chapter introduces the basic ingredients of the entropy 
method and closely related topics, such as the logarithmic-Sobolev inequalities. These topics underlie 
the so-called functional approach to deriving concentration inequalities. The second part is devoted to 
a related viewpoint based on probability in metric spaces. This viewpoint centers around the so-called 
transportation-cost inequalities, which have been introduced into the study of concentration by Marton. 
The third part gives a brief summary of some results on concentration for dependent random variables, 
emphasizing the connections to information-theoretic ideas. The fourth part lists several applications of 
concentration inequalities and the entropy method to problems in information theory. The considered 
applications include strong converses for several source and channel coding problems, empirical distribu- 
tions of good channel codes with non-vanishing error probability, and an information-theoretic converse 
for concentration of measures. 



3.1 The main ingredients of the entropy method 

As a reminder, we are interested in the following question. Let X\, . . . ,X n be n independent random 
variables, each taking values in a set X. Given a function / : X n — > R, we would like to find tight 
upper bounds on the deviation probabilities for the random variable U = f(X n ), i.e., we wish to bound 
from above the probability ¥{\U — MU\ > r) for each r > 0. Of course, if U has finite variance, then 
Chebyshev's inequality already gives 

F(\U-EU\ > r) < Vr>0. (3.1) 

However, in many instances a bound like (|3.ip is not nearly as tight as one would like, so ideally we aim 
for Gaussian-type bounds 

F(\U -EU\ > r) < Kexp (-Kr 2 ) , Vr > (3.2) 

for some constants K, k > 0. Whenever such a bound is available, K is a small constant (usually, K = 2), 
while k depends on the sensitivity of the function / to variations in its arguments. 



S3 
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In the preceding chapter, we have demonstrated the martingale method for deriving Gaussian con- 
centration bounds of the form (|3.2p . In this chapter, our focus is on the so-called "entropy method," an 
information-theoretic technique that has become increasingly popular starting with the work of Ledoux 
|43j (see also [3]). In the following, we will always assume (unless specified otherwise) that the function 
/ : X n — > R and the probability distribution P of X n are such that 

• U = f{X n ) has zero mean: EU = Kf(X n ) = 

• U is exponentially integrable: 

E[exp(AC/)] = E [exp (Xf(X n ))] < oo, VA G R (3.3) 
[another way of writing this is exp(A/) G L l {P) for all A 
In a nutshell, the entropy method has three basic ingredients: 

1. The Chernoff bounding trick — using Markov's inequality, the problem of bounding the deviation 
probability P(|£7 — E£/| > r) is reduced to the analysis of the logarithmic moment- generating function 
A(A) = lnE[exp(At/)L A G M. 

2. The Herbst argument — the function A(A) is related through a simple first-order differential equa- 
tion to the relative entropy (information divergence) D(PW)\\P), where P = P x n is the probability 
distribution of X n and P^ x ^ is the tilted probability distribution defined by 

dP^ exp(A/) { \ p k t \ \\ 

If the function / and the probability distribution P are such that 

D(P (A/)|| P) <^! ( 3. 5) 

for some c > 0, then the Gaussian bound (|3.2p holds with K = 2 and k = ^. The standard way to 
establish (13. 5f) is through the so-called logarithmic Sobolev inequalities. 



3. Tensorization of the entropy — with few exceptions, it is rather difficult to derive a bound like 
(|3.5p directly. Instead, one typically takes a divide-and-conquer approach: Using the fact that Px n is a 
product distribution (by the assumed independence of the AYs), the divergence D(p(*/)||P) is bounded 
from above by a sum of "one-dimensional" (or "local" ) conditional divergence terms 

^(^lill^l^)' *=V..,n (3.6) 

where, for each i, X 1 G X n ~ l denotes the (n — l)-tuple obtained from X n by removing the ith coordinate, 
i.e., X % = (Xi, . . . , Xi—i, Xi+i, . . . , X n ). Despite their formidable appearance, the conditional divergences 
in (13. 61) are easier to handle because, for each given realization X % = x l , the ith such term involves a 
single-variable function fi(-\x l ) : X — > M defined by fi{y\x l ) = f{x\, . . . , y, Zj+i, . . . , x n ) and the 

(A f ) 

corresponding tilted distribution P x \x i -x 1 ' 1 w ^ ere 

dP x!\x*=xi exp (A/iC-l^)) v - i(= «n-l ,o ? x 

dP Xt E[exp(A/ 4 (A,|^))]' ' 1 j 
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In fact, from (I3.4p and (13. 7ft . it is easy to see that the conditional distribution -P^.jjfi =i i is nothing but 

the tilted distribution P^ 1 ^ ' x ^. This simple observation translates into the following: If the function / 
and the probability distribution P = Px n are such that there exist constants c\ , . . . , c n > so that 



P Xi )<^, \/ie{l,... ) n} ) x i eX n -\ (3.8) 



then ()3.5p holds with c = ^?=i c « (t° De shown explicitly later), which in turn gives that 

P( \f(X n ) - Ef(X n )\ > r) < 2 exp (- - - ^ , r > 0. (3.9) 
Again, one would typically use logarithmic Sobolev inequalities to verify (|3.8|) . 

In the remainder of this section, we shall elaborate on these three ingredients. Logarithimic Sobolev 
inequalities and their applications to concentration bounds are described in detail in Sections 13.21 andf 



3.1.1 The Chernoff bounding trick 

The first ingredient of the entropy method is the well-known Chernoff bounding triclQ Using Markov's 
inequality, for any A > we have 

F(U >r) = P(exp(A£T) > exp(Ar)) 
< exp(-Ar)E[exp(Af/)]. 

Equivalently, if we define the logarithmic moment generating function A(A) = lnE[exp(Af7)], A G K, we 
can write 

F(U >r)< exp (A(A) - Ar) , VA > 0. (3.10) 

To bound the probability of the lower tail, ¥(U < —r), we follow the same steps, but with —U instead of 
U. From now on, we will focus on the deviation probability W(U > r). 

By means of the Chernoff bounding trick, we have reduced the problem of bounding the deviation 
probability ¥(U > r) to the analysis of the logarithmic moment-generating function A(A). The following 
properties of A(A) will be useful later on: 

• A(0) = 

• Because of the exponential integrability of U [cf. (j3.3j) ]. A(A) is infinitely differentiable, and one can 
interchange derivative and expectation. In particular, 

, K[Uexp(XU)] E[t/ 2 exp(ALQ] / E[[/exp(A£/)] \ 2 

A (A) " E[exp(Af/)] and A (A) " E[«p(AJ7)] " V E[exp(AC/)] J {3A1} 

Since we have assumed that EC7 = 0, we have A'(0) = and A"(0) = var(U). 

• Since A(0) = A'(0) = 0, we get 

Hm M = 0. (3.12) 

A^0 A 



The name of H. Chernoff is associated with this technique because of his 1952 paper [120| ; however, its roots go back to 
S.N. Bernstein's 1927 textbook on the theory of probability |121j . 
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3.1.2 The Herbst argument 

The second ingredient of the entropy method consists in relating this function to a certain relative entropy, 
and is often referred to as the Herbst argument because the basic idea underlying it had been described 
in an unpublished note by I. Herbst. 

Given any function g : X n — > M. which is exponentially integrable w.r.t. P, i.e., E[exp(g(X n ))] < oo, 
let us denote by P^- 9 ' the g-tilting of P: 

dP(») exp(ff) 



Then 



dP E[exp( 5 )] 



DIP- IIP) - / laf'^W" 1 

X n 



dP 

dP^) /dP(f)\ 

m| — I d-P 



X n dP V dP 

exp(g) 
X n E[exp(g)] 
1 



E[exp(<7)] 
E[ffexp(5)] 



g - lnEfexpG?)]) dP 
gexp(g) dP - lnE[exp(g)] 



E[exp( 5 )] 

In particular, if we let g = tf for some t ^ 0, then 



lnE[exp( 5 )]. 



4™ -■»«!«*/)] 



tA'(t) - A(t) 

A'(t) A(t)\ 

where in the second line we have used (|3.1ip . Integrating from t = to t = A and using (13.12D . we get 

r x D(pw\\p) 

A(A) = A / — i r^—t-dt. (3.14) 

Jo t 

Combining (|3.14p with (|3.10p . we have proved the following: 

Proposition 4. Let U = f{X n ) be a zero-mean random variable that is exponentially integrable. Then, 
for any r > 0, 

F(U > r) < exp ^A D ( P ^\\ P ) dt _ ( VA > 0. (3.15) 

Thus, we have reduced the problem of bounding the deviation probabilities P(Z7 > r) to the problem 
of bounding the relative entropies D(P^^\\P). In particular, we have 
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Corollary 7. Suppose that the function / and the probability distribution P of X n are such that 



D{PM\\P) < 



ct*_ 
2 ' 



Vt > 



for some constant c > 0. Then, 



"[U > r) < exp f-^V Vr>0. 



Proof. Using (|3.16p to upper-bound the integrand on the right-hand side of (|3.16p . we get 

/cA 2 \ 
F(U > r) < exp f — - Ar J , VA > 0. 



(3.16) 



(3.17) 



(3.18) 



Optimizing over A > to get the tightest bound gives A = £ and its substitution in (13. 18ft gives the 
bound in (ETT71) . □ 

3.1.3 Tensorization of the (relative) entropy 

The relative entropy D(P^^\\P) involves two probability measures on the Cartesian product space X n , 
so bounding this quantity directly is generally very difficult. This is where the third ingredient of the 
entropy method, the so-called tensorization step, comes in. The name "tensorization" reflects the fact 
that this step involves bounding D(P^\\P) by a sum of "one-dimensional" relative entropy terms, each 
involving the conditional distributions of one of the variables given the rest. The tensorization step hinges 
on the following simple bound: 

Proposition 5. Let P and Q be two probability measures on the product space X n , where P is a product 
measure. For any i £ {1, . . . , n}, let X % denote the (n — l)-tuple (Xi, . . . , Xi+i, . . . , X n ) obtained 
by removing X{ from X n . Then 



D{Q\\P)<Y J ^>{Qx i \xi\Px i \Qxi)- 

i=l 

Proof. From the relative entropy chain rule 

n 

D(Q\\P) = JXQ Xl | X4 -i \\P Xi \x*-i\Qxi-i) 
1=1 

n 

= J2 D (Qx l \x^\\Px l \Qx^) 



(3.19) 



(3.20) 



where the last equality holds since X%, . . . , X n are independent random variables under P (which implies 
that Px^x^ 1 = ^ > x i \x i = Pxi)- Furthermore, for every i £ {1, . . . ,n}, 

D[Q Xi \X* \\Pxt\Qxi) ~ D{Q Xt \x^ H^lQx'-i) 



In 



dg 



Xt\X* 



In 



dQxi\xi 



In 



d Qx t \ 



X i-1 



^Qx^Xi- 

= D(Q x ^ X i\\Q Xt \x i ' 1 \Q X i) > 0. 
Hence, by combining (|3.20p and ()3.21 j) . we get the inequality in (j3. 19|) . 



(3.21) 



□ 
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Remark 19. The quantity on the right-hand side of (|3.19p is actually the so-called erasure divergence 
D~{Q\\P) between Q and P (see [122, Definition 4]), which in the case of arbitrary Q and P is defined 
by 

n 

D-(Q\\P) ^Y, D ^x i \xA\Px i \xAQx^)- (3-22) 
i=l 

Because in the inequality (13. 19H P is assumed to be a product measure, we can replace Px i \x i °y P Xf F° r 
a general (non-product) measure P, the erasure divergence D~(Q\\P) may be strictly larger or smaller 
than the ordinary divergence D(Q\\P). For example, if n = 2, Px 1 = Qxi, Px 2 = Qx 2 , then 

&Qx x \x 2 _ &Qx 2 \x 1 _ dQxi,x 2 
dP X!|X 2 dP X 2 \X! dP Xl ,x 2 ' 

so, from (j3T22|) . 

D-{Q x ^ X2 \\P Xl ,x 2 ) = D(Q XllX2 \\P XllX2 \Q X2 ) + D(Q X2{Xl \\P X2lXl \Q Xl ) = 2D(Q Xl ,x 2 \\ p x 1 ,x 2 )- 

On the other hand, if X x = X 2 under both P and Q, then D~(Q\\P) = 0, but D{Q\\P) > whenever 
P / Q, so P(Q||P) > D-(Q||P) in this case. 

Applying Proposition [5] with Q = P^^ to bound the divergence in the integrand in (|3.15p . we obtain 
from Corollary [7] the following: 

Proposition 6. For any r > 0, we have 

/ n f X D(P (tf) -]\P X \P { - f) ) \ 

¥(U>r)<exp\XJ2 XilX \ X ' - df-Ar , VA > 0. (3.23) 



i=l 



The conditional divergences in the integrand in (13.23H may look formidable, but the remarkable thing is 
that, for each i and a given X 1 = x l , the corresponding term involves a tilting of the marginal distribution 
Pxi- Indeed, let us fix some i 6 {1, ... ,n}, and for each choice of x l 6 X n_1 let us define a function 
fiftx*) ■ * -> K by setting 

fi{yW) = f(xi,.. .,Xi-i,y,x i+1 , ...,x n ), \/y£X. (3.24) 

Then 

dP x^=si _ exp (/i(-|x 1 )) ^ 



dP Xl " E [exp (/ipTilxO)] ' 



In other words, P\: l<?i is the f,(-|x*)-tilting of Pv . This is the essence of tensorization: we have 

effectively decomposed the n-dimensional problem of bounding D(P^^\\P) into n one-dimensional prob- 
lems, where the ith problem involves the tilting of the marginal distribution P Xi by functions of the form 
fi(-\x l ), Vx\ In particular, we get the following: 

Corollary 8. Suppose that the function / and the probability distribution P of X n are such that there 
exist some constants ci, . . . , c n > 0, so that, for any t > 0, 

D ( p x! i(, ' lXt)) \\ p Xi) < Vie {!,...,«}, x*€ X n ~ l . (3.26) 



Then 



2 



/(X")-E/(Z w )>r) <exp(- 2 ^ ), Vr>0. (3.27) 
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Proof. For any t > 

£>(p(*/)||P) 

n 



i=l 
n 



i=l ,/;1: ' n 
ri ~ 

E ^f^lM (3.30) 

2 



^E/ ^r4?W) (3.3i) 



1=1 

t 2 



= ^-E c * ( 3 - 32 ) 
i=i 

where (|3.28p follows from the tensorization of the relative entropy, (|3.29p holds since P is a product 
measure (so Pxi = PxAX 1 ) ano - by the definition of the conditional relative entropy, (|3,30p follows from 

([S21D and (ET251) which implies that 4^=2* = P x^ ' and inequality (EOTD holds by the assumption 
in (|3,26p . Finally, the inequality in (I3.27P follows from (|3.32j) and Corollary [71 □ 

3.1.4 Preview: logarithmic Sobolev inequalities 

Ultimately, the success of the entropy method hinges on demonstrating that the bounds in (|3.26p hold for 
the function / : X n —> R and the probability distribution P = Px n of interest. In the next two sections, 
we will show how to derive such bounds using the so-called logarithmic Sobolev inequalities. Here, we will 
give a quick preview of this technique. 

Let /i be a probability measure on X, and let A be a family of real-valued functions g : X — >• R, 
such that for any a > and g £ A, also ag £ A. Let E : A — > R + be a non-negative functional that is 
homogeneous of degree 2, i.e., for any a > and g £ A, we have E(ag) = a 2 E{g). Suppose further that 
there exists a constant c > 0, such that the inequality 

0(M W llM)<^ (3-33) 

holds for any g £ A. Now, suppose that, for each i £ {1, . . . ,n}, inequality (|3.33p holds with \i = Px i 
and some constant c% > where A is a suitable family of functions / such that, for any x % £ X n_1 and 
i £ {1, . . . , n}, 

1. fi(-\x i )£A 

2. E{f i {-\x i )) < 1 

where fa is defined in (|3.24p . Then, the bounds in (|3.26p hold since from (|3.33p and the above properties 
of the functional E, it follows that for every t > and x % £ X n_1 

D ( P M_ \\ Px 

< CiEjtfii-lx 1 )) 
2 

Cit 2 E(f i {-\x i )) 



2 



Cit 2 



< — , Vi £ {!,... ,n}. 
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Consequently, the Gaussian concentration inequality in (|3.27p follows from Corollary 



3.2 The Gaussian logarithmic Sobolev inequality (LSI) 



Before turning to the general scheme of logarithmic Sobolev inequalities in the next section, we will 
illustrate the basic ideas in the particular case when X\, . . . ,X n are i.i.d. standard Gaussian random 
variables. The relevant log-Sobolev inequality in this instance comes from a seminal paper of Gross 
|44j . and it connects two key information-theoretic measures, namely the relative entropy and the rel- 
ative Fisher information. In addition, there are deep links between Gross's log-Sobolev inequality and 
other fundamental information-theoretic inequalities, such as Stam's inequality and the entropy power 
inequality. Some of these fundamental links are considered in this section. 

For any n £ N and any positive-semidefinite matrix K 6 M nxn , we will denote by G\ the Gaussian 
distribution with zero mean and covariance matrix K. When K = sl n for some s > (where I n denotes 
the n x n identity matrix), we will write G™. We will also write G n for when n > 2, and G for G\. 
We will denote by 7^, 7", 7 S , and 7 the corresponding densities. 

We first state Gross's inequality in its (more or less) original form: 



Theorem 21. For Z ~ G n and for any smooth function <fi : W 1 — > R, we have 

E{<P 2 (Z) In cf) 2 (Z)} -El<p 2 {Z)]lnE{<p 2 (Z)} < 2E [||V0(Z)|| 2 ] 



(3.34) 



Remark 20. As shown by Carlen [123] . equality in (|3.34p holds if and only if is of the form (j){z) 
exp (a,z) for some a G M. n , where (•, •) denotes the standard Euclidean inner product. 



Remark 21. There is no loss of generality in assuming that E[ 
as 



1 {Z)\ = 1. Then (|3.34p can be rewritten 



E[^(Z)ln0 2 (Z)]<2E[||V^(Z)|| ; 



if E[(/> 2 (Z)] = 1, Z~G n . 



(3.35) 



Moreover, a simple rescaling argument shows that, for Z ~ G™ and an arbitrary smooth function <j> with 
E[0 2 (Z)] = 1, 



E[<p 2 (Z)ln ( p 2 {Z)} < 2sE [||V^(Z)|| 2 ] . 



(3.36) 



An information-theoretic proof of the Gaussian LSI (Theorem [2T]) is provided in the continuation to 
this section. The reader is also referred to |124j for another proof that is not information-theoretic. 

From an information-theoretic point of view, the Gaussian LSI (I3.34p relates two measures of (dis) similarity 
between probability measures — the relative entropy (or divergence) and the relative Fisher information 
(or Fisher information distance). The latter is defined as follows. Let P\ and P2 be two Borel probability 
measures on W 1 with differentiable densities p\ and p2- Then the relative Fisher information (or Fisher 
information distance) between Pi and P2 is defined as (see [125^ Eq. (6.4.12)]) 



/(P1IIP2 



/ 






P2{Z) 



Pi(z)dz = E Pl 



Vln 



dPi 



dPo 



(3.37) 



whenever the above integral converges. Under suitable regularity conditions, /(P1HP2) admits the equiv- 
alent form (see [Ml Eq. (1.108)]) 



/(Pi||P 2 ) = 4 I p 2 (z) 



I pi(z) 

P2(z) 



dz = 4Ep 2 



V 



dp 
dP 2 



(3.38) 
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Remark 22. One condition under which f|3.38j) holds is as follows. Let £ : 
(or weak) gradient of dP\ / dP2 = \Jv\IV2i i-e., the equality 



i n be the distributional 



P2O) 



diip(z)dz 



o 

p7l Sec. 6.6]. Then (|3T38jl holds, provided 



holds for all i = 1, . . . , n and all test functions tp G C£°(I 

Now let us fix a smooth function : R n — > R satisfying the normalization condition L n (j) 2 dG n = 1 
we can assume w.l.o.g. that <fi > 0. Let Z be a standard n-dimensional Gaussian random variable, i.e. 
= G n , and let Y G R n be a random vector with distribution Py satisfying 



dPy _ dPy 

dPz~ ~ dG™ 



Then, on the one hand, we have 



E 



E 



D(Py\\Pz), 



(3.39) 



and on the other, from (j3.38|) . 

E[||V<^)|| 2 ] =E 



Substituting (|3.39p and (|3.40p into (|3.35p . we obtain the inequality 



D(P Y \\Pz) < ^I(Py\\Pz), 



Pz = G n 



(3.40) 



(3.41) 



which holds for any Py <C G n with V \JdPy/dG n G L 2 (G n ). Conversely, for any Py <C G n satisfying 
(|3.4ip . we can derive (|3.35|) by letting <p = ^/dPy/dG n , provided V</> exists (e.g., in the distributional 
sense). Similarly, for any s > 0, (|3.36|) can be written as 



D(Py\\P Z )<-I(Py\\P Z ), P Z = G n s . 

Now let us apply the Gaussian LSI (I3.34p to functions of the form 4> = 
behaved g : R n — > R. Doing this, we obtain 



(3.42) 

exp(<7/2) for all suitably well- 



E 



exp(si)ln 



exp(g) 
E[exp( 9 )] 



<±E[\\Vg\\ 2 eM9)), 



(3.43) 



where the expectation is w.r.t. G n . If we let P = G n , then we can recognize the left-hand side of (|3.43p 
as E[exp(g)] • D(P^\\P), where P^ denotes, as usual, the g-tilting of P. Moreover, the right-hand side 
is equal to E[exp(g)] • Ep^UVgH 2 ] with Ep^[-] denoting expectation w.r.t. P^ 9 \ We therefore obtain the 
so-called modified log-Sobolev inequality for the standard Gaussian measure: 



D(P^\\P)<^[\\Vg\\ 2 ] 



(3.44) 



which holds for all smooth functions g : R n — > R that are exponentially integrable w.r.t. G n . Observe 
that (|531D implies ([333]) with /i = G n , c = 1, and E(g) = W^gf^. 

In the remainder of this section, we first present a proof of Theorem 12 11 and then discuss several appli- 
cations of the modified log-Sobolev inequality (|3.44[) to derivation of Gaussian concentration inequalities 
via the Herbst argument. 
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3.2.1 An information-theoretic proof of Gross's log-Sobolev inequality 



In accordance with our general theme, we will prove Theorem 1211 via tensorization: We first scale up to 
general n using suitable (sub)additivity properties, and then establish the n = 1 case. Indeed, suppose 
that (I3.34D holds in dimension 1. For n > 2, let X = {Xx, . . . ,X n ) be an n-tuple of i.i.d. JV(0, 1) variables 
and consider a smooth function (ft : R n — > R, such that Ep[</> 2 (X)] = 1, where P = Px = G n is the 
product of n copies of the standard Gaussian distribution G. If we define a probability measure Q = Qx 
with dQx/dPx = <j) 2 , then using Proposition [5] we can write 



E P [<p 2 {X)\n4> 2 {X)\ = E P 



dQ^dQ 
dP n dP 
D(Q\\P) 



<Y, D (Qx.\x4 p x*\Qx>) 



Following the same steps as the ones that led to (|3.24p . we can define for each i = 1, 
x 1 = (xi, . . . , Xi-i,Xi + i, . . . , x n ) G R n_1 the function 4>i{-\x % ) : R — >• R via 



(3.45) 



, n and each 



\ A 



{Xx, . 



Then 



i Ui 1 > • • • ? -En ) i 



Vx* G R n_1 , y G 



dQx^X^x* 

dP Xl " Ep[0 2 pQ|x*)] 



for all i G {1, . . . , n}, x % G R n . With this, we can write 



D{Q Xl \x4 P ^\Qx^) =Eq 



In 



dQ 



Xi\X* 



dp 



x, 



Ep 
E P 
E P 



dQ, dQ Xi \ X i 
In ■ 



dP dP x 
</> 2 (X)ln 



Ep^pQlX*)!^ 



! pQ|X 4 )ln 



! (^)ln' 



^(Xilx*) 



P^(dx 4 ). 



E P [</>?(X^)] 

Since each X- L ~ G, we can apply the Gaussian LSI (I3.34p to the univariate functions (pi(-\x l ) to get 



(3.46) 



E, 



if (Ailx*) In 



Ep[0 2 pQ|x*')] 



< 2E P 



where 



\x l ) d(j)(x) 



Vi = 1, . . . , n; x 1 G 



pn— 1 



(3.47) 



dy <9x 

Since Xx,... , X n are i.i.d. under P, we can express f)3.47j) as 

^(X<|x*) 



E, 



! pQ|x l )ln 



Ep [$(X<|s' 



< 2Ep 



X* = x* 
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Substituting this bound into f|3.46j) . we have 

D{Q Xt \x4 P ^\Qx>) < 2Ep [(d i( />(X)f 

In turn, using this to bound each term in the summation on the right-hand side of (|3.45p together with 
the fact that Y%=1 (^<X^)) 2 = 11^0*0 w e get 

E P [<p 2 {X) In cf) 2 (X)] < 2E P r||V0(X)|| 2 j , (3.48) 

which is precisely the n-dimensional Gaussian LSI f)3.35|) for general n > 2 provided that it holds for 
n = 1. 

Based on the above argument, we will now focus on proving the Gaussian LSI for n = 1. To that end, 
it will be convenient to express it in a different but equivalent form that relates the Fisher information and 
the entropy power of a real-valued random variable with a sufficiently regular density. In this form, the 
Gaussian LSI was first derived by Stam [45], and the equivalence between Stam's inequality and (|3.34p 
was only noted much later by Carlen |123j . We will first establish this equivalence following Carlen's 
argument, and then give a new information-theoretic proof of Stam's inequality that, unlike existing 
proofs |128[ |4"7]. does not require de Bruijn's identity or the entropy-power inequality. 

First, lets start with some definitions. Let Y be a real- valued random variable with density py- The 
differential entropy of Y (in nats) is given by 

/oo 
PY (y) In p Y (y)d yi (3.49) 
-oo 

provided the integral exists. If it does, then the entropy power of Y is given by 

2ire 

Moreover, if the density py is differentiable, then the Fisher information (w.r.t. a location parameter) is 
given by 

J(Y) = J( PY ) = l°° (Amp y (y)) p Y (y)dy = E[p 2 Y (Y)], (3.51) 
where p Y (y) = (d/dy) lnpy (y) = 

is known as the score function. 

Remark 23. In theoretical statistics, an alternative definition of the Fisher information (w.r.t. a location 
parameter) of a real- valued random variable Y is (see [1291 Definition 4.1]) 

J{Y) 4 sup^E^On) 2 : ^ e C\E^ 2 (Y)} = l} (3.52) 

so the supremum is taken over the set of all continuously differentiable functions tp with compact support 
where E[V' 2 (^)] = 1. Note that this definition does not involve derivatives of any functions of the density 
of Y (nor assumes that such a density even exists). It can be shown that the quantity defined in (|3,52p 
exists and is finite if and only if Y has an absolutely continuous density py, in which case J(Y) is equal 
to (EE5ID (see [Ull Theorem 4.2]). 

We will need the following facts: 

1. ED(Py\\G s ) < oo, then 

^(W) = ^n^ + iln S -i + lEy 2 . (3.53) 
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This is proved by direct calculation: Since D(Py\\G s ) < oo, we have Py >C G s and dPy/dG s = py/j s . 
Then 



/oo 
py(y) In 
-oo 



py{v) 
is{y) 



dy 



-h(Y) + hn(2TTs) + ^EY 2 

-\ (2h(Y) - ln(2ne)) + \ In s - \ + 1 Ey 2 
2 2 2 2s 

- In — ^— + ~lns-- + — Ey 2 , 
2 iV(y) 2 2 2s 



which is (|3.53p . 
2. If J(y) < oo and Ey 2 < oo, then for any s > 



I(P Y \\G S ) = J(Y) + ^EY 2 -- <oo, (3.54) 



where /(-||-) is the relative Fisher information, cf. ()3.37p . Indeed: 

f°° ( d d 

I(Py\\G s ) = J py(y) ( — \np Y (y) - — \nj s (y) ) dy 

f°° ( y\ 2 
= J Pr{y) ypviy) + -J dy 

= E[p 2 Y (Y)} + - E[Ypy(Y)} + -i Ey 2 
s s z 



- E[Ypy(Y)} + - 1 
s s 

r2 



J{Y) + - E[Yp Y (Y)\ + 3 Ey 2 . (3.55) 



Since IEY^ <C oo then also <C oo, so liniy—^-i-oo ypy{y) = 0. Furthermore, integration by parts gives 

E[y Py (y)] 

y PY{y)PY{y)dy 



-oo 

00 



yp'yiy) dy 

lim ypy(y)- lim ypy(y)) - Pv(y)dy 



2/— >oo 
= -1 

so E[ypy(y)] = —1 (see |130j Lemma Al] for another proof). Its Substitution in (|3.55p gives (|3.54j) . 
We are now in a position to prove the following: 

Proposition 7 (Carlen [123J). Let Y be a real-valued random variable with a smooth density py, such 
that J(y) < oo and Ey 2 < oo. Then, the following statements are equivalent: 

1. Gaussian log-Sobolev inequality, D(P Y \\G) < \ I(Py\\G). 

2. Stam's inequality, N(Y)J(Y) > 1. 
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Remark 24. Carlen's original derivation in |123j requires py to be in the Schwartz space <S(M) of infinitely 
differentiable functions, all of whose derivatives vanish sufficiently rapidly at infinity. In comparison, the 
regularity conditions of the above proposition are much weaker, requiring only that Py has a differentiable 
and absolutely continuous density, as well as a finite second moment. 

Proof. We first show the implication 1) 2). If 1) holds, then 

D(P Y \\G S ) < |/(iV||G s ), Vs > 0. (3.56) 

Since J(Y) and EY 2 are finite by assumption, the right-hand side of f|3.56|) is finite and equal to (|3,54p . 
Therefore, D(Py\\G s ) is also finite, and it is equal to (|3.53p . Hence, we can rewrite (|3.56|) as 

Because EY 2 < oo, we can cancel the corresponding term from both sides and, upon rearranging, obtain 

ln^< S J(Y)-ln S -l. 

Importantly, this bound holds for every s > 0. Therefore, using the fact that, for any a > 0, 

1 + In a = inf (as — Ins), 

s>0 

we obtain Stam's inequality N(Y)J(Y) > 1. 

To establish the converse implication 2) 1), we simply run the above proof backwards. □ 

We now turn to the proof of Stam's inequality. Without loss of generality, we may assume that EY = 
and EY 2 = 1. Our proof will exploit the formula, due to Verdii [131J, that expresses the divergence in 
terms of an integral of the excess mean squared error (MSE) in a certain estimation problem with additive 
Gaussian noise. Specifically, consider the problem of estimating a real-valued random variable Y on the 
basis of a noisy observation yfsY + Z, where s > is the signal-to-noise ratio (SNR) and the additive 
standard Gaussian noise Z ~ G is independent of Y. If Y has distribution P, then the minimum MSE 
(MMSE) at SNR s is defined as 

mmse(Y, s) = inf E[(Y - cp(y/s~Y + Z))\ (3.57) 

where the infimum is over all measurable functions (estimators) (p : M — > R. It is well-known that the 
infimum in (I3.57P is achieved by the conditional expectation u i— >• EfYl-^/iY + Z = u], so 

mmse(Y,s) = E \(Y — E[Y\s/s~Y + Z]) 2 

On the other hand, suppose we instead assume that Y has distribution Q and therefore use the mismatched 
estimator u i— > EqYI-^/sY + Z = u], where the conditional expectation is now computed assuming that 
Y ~ Q. Then, the resulting mismatched MSE is given by 

mse Q (Y, S ) =E [(Y-E Q [Y|v^Y + Z]) 2 

where the outer expectation on the right-hand side is computed using the correct distribution P of Y. 
Then, the following relation holds for the divergence between P and Q (see |13H Theorem 1]): 

1 f°° 

D(P\\Q) = - / [mse Q (Y, s) - mmse(Y, s)} ds. (3.58) 

2 Jo 
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We will apply the formula (15351) to P = P Y and Q = G, where Py satisfies EY = and EY 2 = 1. Then 
it can be shown that, for any s > 0, 

mseg(Y, s) = mse^Y, s) = lmmse(Y, s), 

where lmmse(Y, s) is the linear MMSE, i.e., the MMSE attainable by any affine estimator u \— > au + b, 
a,b£R: 

lmmse(Y, s) = inf E \(Y - a(^sY + Z) - bf \ . (3.59) 
The infimum in (13.59P is achieved by a* = y/s/(l + s) and 6 = 0, giving 

lmmse(Y, s) = ( 3 - 60 ) 

Moreover, mmse(Y, s) can be bounded from below using the so-called van Trees inequality [132J (see also 
Appendix 



Then 



1 r°° 

D(P Y \\G) = - / (lmmse(Y,s) - mmse(Y,s))ds 

2 Jo 

-2 I {lT~s~ J(Y) + s) ds 

= — lim f ( ; — | ds 

2a^oo7 \1 + s J(Y) + sJ 



( J(Y)(1 + X) 



= — lim In , 

2 A^oo V J ( Y ) + X 

= ilnJ(Y), (3.62) 

where the second step uses (|3.60p and (|3.6ip . On the other hand, using (|3.53p with s = EY 2 = 1, we 
get D(P Y \\G) = ~ln(l/iV(Y)). Combining this with ([532]) . we recover Stam's inequality N(Y)J(Y) > 1. 
Moreover, the van Trees inequality (13.611) is achieved with equality if and only if Y is a standard Gaussian 
random variable. 



3.2.2 From Gaussian log-Sobolev inequality to Gaussian concentration inequalities 

We are now ready to apply the log-Sobolev machinery to establish Gaussian concentration for random 
variables of the form U = f{X n ), where X\, . . . , X n are i.i.d. standard normal random variables and 
/ : M. n — >• R is any Lipschitz function. We start by considering the special case when / is also differentiable. 

Proposition 8. Let X\, . . . , X n be i.i.d. Af(0, 1) random variables. Then, for every differentiable function 
/ : R n -> R such that ||V/(X n )|| < 1 almost surely, we have 

P(/PT) > Ef(X n ) + r) < exp f-j\ , Vr > (3.63) 
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Proof. Let P = G n denote the distribution of X n . If Q is any probability measure such that P and Q 
are mutually absolutely continuous (i.e., Q <C P and P <C Q), then any event that has P-probability 
1 will also have Q-probability 1 and vice versa. Since the function / is differentiable, it is everywhere 
finite, so and P are mutually absolutely continuous. Hence, any event that occurs P-a.s. also occurs 
pW-a.s. for all t G R. In particular, ||V/(X n )|| < 1 P^-a.s. for all t > 0. Therefore, applying the 
modified log-Sobolev inequality (13.44p to g = tf for some t > 0, we get 



D(P (t/)|| P) <^/) [||V/(X")|| 2 ] < | 
Using Corollary [7] with U = f(X n ) - Ef(X n ), we get §Mj) - 



(3.64) 
□ 



Remark 25. Corollary [7] and inequality (|3.44|) with g = tf imply that, for any smooth function / with 
||V/(X«)|| 2 <La.s., ' 



f(X n )>Ef(X n ) + r) <exp(-— 1 



Vr > 0. 



(3.65) 



Thus, the constant n in the corresponding Gaussian concentration bound (|3.2p is controlled by the 
sensitivity of / to modifications of its coordinates. 

Having established concentration for smooth /, we can now proceed to the general case: 

Theorem 22. Let X n be as before, and let / : W 2 — > R be a 1-Lipschitz function, i.e., 

|/(z n )-/(y n )|<||z n -iHI, vx n , y n sr. 

Then 



F(/(X n ) > Ef(X n )+r) < exp 



Vr > 0. 



(3.66) 



Proof. The trick is to slightly perturb / to get a differentiable function with the norm of its gradient 
bounded by the Lipschitz constant of /. Then we can apply Proposition [8l and consider the limit of 
vanishing perturbation. 

We construct the perturbation as follows. Let Z\, . . . , Z n be n i.i.d. JV(0, 1) random variables, inde- 
pendent of X n . For any 5 > 0, define the function 



n\ A 



E 



f(x n + V5Z n ) 



1 



(2vr)«/ 2 
1 

(2tt5)-/ 2 J k 



f{x n + V6z n ) exp 
/(^)exp 



\z 



n\\2 



2 



dz r 



26 



dz n . 



It is easy to see that fs is differentiable (in fact, it is in C°°; this is known as the smoothing property of 
the Gaussian convolution kernel). Moreover, using Jensen's inequality and the fact that / is 1-Lipschitz, 

\fs(x n ) - f(x n )\ = E[f(x n + V~SZ n )] - f{x r 



< E 



f(x n + V6Z n ) - f(x n ) 



< VSE\\Z n \\. 

Therefore, lim^o f&{x n ) = f(x n ) for every x n G W 1 . Moreover, because / is 1-Lipschitz, it is differentiable 
almost everywhere by Rademacher's theorem [133^ Section 3.1.2], and ||V/|| < 1 almost everywhere. 
Consequently, since Vf$(x n ) = E[V/(x n + \/<5Z n )], Jensen's inequality gives 



\\Vf s (x n )\\ <E\\Vf(x n + VSZ n )\\ < 1 
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for every x n G W 1 . Therefore, we can apply Proposition [8] to get, for all 5 > and r > 0, 



P 



>E/*pT») + r) < ex P 




) 



Using the fact that f$(x n ) converges to f{x n ) everywhere as 5 



0, we obtain ([3.661) : 




) 



- E [ 1 {/(X")>E/(X")+r}] 

- S IE [ 1 {/*( Xn )^ E /*( Xn )+ r }] 

= hmP(/ 5 (X")>E/ 5 (X")+r) 




where the first inequality is by Fatou's lemma. 



□ 



3.2.3 Hypercontractivity, Gaussian log-Sobolev inequality, and Renyi divergence 

We close our treatment of the Gaussian log-Sobolev inequality with a striking result, proved by Gross in 
his original paper [44] , that this inequality is equivalent to a very strong contraction property (dubbed 
hypercontractivity) of a certain class of stochastic transformations. The original motivation behind the 
work of Gross |44j came from problems in quantum field theory. However, we will take an information- 
theoretic point of view and relate it to data processing inequalities for a certain class of channels with 
additive Gaussian noise, as well as to the rate of convergence in the second law of thermodynamics for 
Markov processes [I34J. 

Consider a pair (X, Y) of real- valued random variables that are related through the stochastic trans- 
formation 



for some t > 0, where the additive noise Z ~ G is independent of X. For reasons that will become clear 
shortly, we will refer to the channel that implements the transformation (|3.67p for a given t > as the 
Ornstein-Uhlenbeck channel with noise parameter t and denote it by OU(t). Similarly, we will refer to 
the collection of channels {OU(i)}£2. indexed by all t > as the Ornstein-Uhlenbeck channel family. We 
immediately note the following properties: 

1. OU(0) is the ideal channel, Y = X. 

2. If X ~ G, then Y ~ G as well, for any t. 

3. Using the terminology of [13., Chapter 4], the channel family {OU(t)}^ is ordered by degradation: for 
any t\ , t% > we have 



which is shorthand for the following statement: for any input random variable X, any standard Gaussian 
Z independent of X, and any ti,t2 > 0, we can always find independent standard Gaussian random 
variables Z\ } Z<i that are also independent of X, such that 



Y = e~ l X + v 7 ! - e" 2 *Z 



(3.67) 



OU(ti + t 2 ) = OU(t 2 ) o OU(ii) = OU(ti) o OU(i 2 ) 



(3.68) 



e -(ti+t2) x + ^ 1 _ e -2(t 1 +t 2 ) z 4 e -to L-ti X + ^1-e-^zA + y/l-e- 2t *Z 2 



= e"' 1 \e' t2 X + Vl-e-^Zil + \l\-e~' 1 ^Z 2 



(3.69) 
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where = denotes equality of distributions. In other words, we can always define real- valued random 
variables X, Yi, Y 2 , Z\, Z2 on a common probability space (fi, J-, P), such that Z\, Z2 ~ G, (X, Zi, Z2) are 
mutually independent, 



Y x = e~ h X + a/1 - e - 2 *iZi 

Y 2 = e~ (tl+t2) X + y/l - e- 2 (*i+*2)Z 2 



and X — > Y\ — > Y2 is a Markov chain. Even more generally, given any real-valued random vari- 
able X, we can construct a continuous-time Markov process {Yi}^ w ith Yq = X and Yt = e~ t X + 
VI - e- 2 W(0, 1) for all t > 0. One way to do this is to let {lt}^ be governed by the ltd stochastic 
differential equation (SDE) 



dY t = -Y t dt + V2dB t , t > 



(3.70) 



with the initial condition Yq = X, where {Bt} denotes the standard one-dimensional Wiener process 
(a.k.a. Brownian motion). The SDE (|3.7U|) is known as the Langevin equation \T65\ p. 75], and the 
random process {Yt} that solves it is called the Ornstein-Uhlenbeck process; the solution of (|3.70p is 
given by (see, e.g., [EH p. 358] or [H3 p. 127]) 



Y t = Xe-* + V2 / e- {t ~ s) dB 



t > 



where, by the Ito isometry, the variance of the (zero-mean) additive Gaussian noise is indeed 



E 



V5jTe-<*->dB.Y 



2 / e-^ds = 2e 



2t 



J' 

JO 



e 2s ds = l-e~ 2t , Vi>0. 



This explains our choice of the name "Ornstein-Uhlenbeck channel" for the random transformation f|3.67j) . 

In order to state the main result to be proved in this section, we need the following definition: the Renyi 
divergence of order a G M + \{0, 1} between two probability measures, P and Q, is defined as 



D a (P\\Q)±l a - llnEc 
l+oo, 



dpY 

\dQj 



, ifP«Q 
otherwise. 



(3.71) 



We recall several key properties of the Renyi divergence (see, for example, [138] ): 
1. The Kullback-Leibler divergence D(P\\Q) is the limit of D a (P\\Q) as a tends to 1 from below 



and 



D(P\\Q) = lim D a (P\\Q) 



D(P\\Q) = sup D a (P\\Q) < inf D a (P 

0<a<l a > 1 



Moreover, if D(P\\Q) = 00 or there exists some f3 > 1 such that Dr(P\\Q) < 00, then also 

D(P\\Q) =limD a (P\\Q). 

all 



(3.72) 



2. If we define Di(P\\Q) as D(P\\Q), then the function a t-t D a (P\\Q) is nondecr easing. 
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3. For all a > 0, satisfies the data processing inequality: if we have two possible distributions P and 

Q for a random variable U, then for any channel (stochastic transformation) T that takes U as input we 
have 



D a (P\\Q) < D a (P\\Q), Vq > (3.73) 

where P or Q is the distribution of the output of T when the input has distribution P or Q, respectively. 

4. The Renyi divergence is non-negative for any order a > 0. 

Now consider the following set-up. Let X be a real- valued random variable with a sufficiently well-behaved 
distribution P (at the very least, we assume P <^ G). For any t > 0, let Pt denote the output distribution 
of the OU(i) channel with input X ~ G. Then, using the fact that the standard Gaussian distribution G 
is left invariant by the Ornstein-Uhlenbeck channel family together with the data processing inequality 
(l3~73j) . we have 

D a (P t \\G) < D a (P\\G), Vi>0, a>0. (3.74) 



In other words, as we increase the noise parameter t, the output distribution Pt starts to resemble the 
invariant distribution G more and more, where the measure of resemblance is given by any of the Renyi 
divergences. This is, of course, nothing but the second law of thermodynamics for Markov chains (see, 
e.g., [961 Section 4.4] or |134j ) applied to the continuous-time Markov process governed by the Langevin 
equation (|3.70p . We will now show, however, that the Gaussian log-Sobolev inequality of Gross (see 
Theorem 12 ip implies a stronger statement: For any a > 1 and any e 6 (0,1), there exists a positive 
constant r = r(a, e), such that 

D a (P t \\G) < eD a (P\\G), Vt > r. (3.75) 



Here is the precise result: 

Theorem 23 (Hypercontractive estimate for the Ornstein-Uhlenbeck channel). The Gaussian log- 
Sobolev inequality of Theorem [21] is equivalent to the following statement: For any 1 < /3 < a < oo 

Remark 26. To see that Theorem [23] implies ([3775]) . fix a > 1 and e G (0, 1). Let 

a 



(3 = (3(e,a) 



a — e(a — 1) 



It is easy to verify that 1 < j3 < a and jgn^j = £ - Hence, Theorem 1231 implies that 
D a (P t \\P)<eD (P\\G), Vt>^lnfl + a(1 ~ £) ') ±r(a,e). 



Since the Renyi divergence D a (-||-) is monotonic non-decreasing in the parameter a, and 1 < ft < a, then 
it follows that Dp(P\\G) < D a (P\\G). It therefore follows from the last inequality that 

D a (P t \\P) <eD a (P\\G), Vt>r(a,e). 



We now turn to the proof of Theorem [23] 
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Proof. As a reminder, the L p norm of a real-valued random variable U is denned by ||I7[L = (E[|[/| p ]) 1/l ' p 
for p > 1. It will be convenient to work with the following equivalent form of the Renyi divergence in 
(|3,7ip : For any two random variables U and V such that Pjj <C Py, we have 



D a (Pu\\P v ) 



a 



a 



1 



In 



dPu 



dP v 



{V) 



a > 1. 



(3.77) 



Let us denote by g the Radon-Nikodym derivative dP/dG. It is easy to show that P% <C G for all i, so 
the Radon-Nikodym derivative gt — dPt/dG exists. Moreover, go = g. Also, let us define the function 
a : [0, oo) -)• [/3, oo) by a(t) = 1 + (/3 - l)e 2t for some (3 > 1. Let Z ~ G. Using (137771) . it is easy to verify 
that the desired bound (|3.76p is equivalent to the statement that the function F : [0, oo) — > R, defined by 



F(t) = In 



dG { ' 



«(*) 



in ||<a(z)|| 



a(t) 



is non-increasing. From now on, we will adhere to the following notational convention: we will use either 
the dot or d/dt to denote derivatives w.r.t. the "time" t, and the prime to denote derivatives w.r.t. the 
"space" variable z. We start by computing the derivative of F w.r.t. t, which gives 



1 



dt \a(t) 
a(t) 



InE 



InE 



9t(Z)) a{t) 



+ 



1 dt 


'(9t(Z)) ait \ 


a(t) E 


'(9t(Z)T {tY 





(3.78) 



To handle the derivative w.r.t. t in the second term in (|3.78p . we need to delve a bit into the theory of the 
so-called Ornstein-Uhlenbeck semigroup, which is an alternative representation of the Ornstein-Uhlenbeck 
channel (ETBTD . 

For any t > 0, let us define a linear operator Kt acting on any sufficiently regular (e.g., L l (G)) 
function h as 

K t h(x) = E h (V*x + y/l - e~ 2t Z^j , (3.79) 

where Z ~ G, as before. The family of operators {Kt}^Q has the following properties: 

1. Kq is the identity operator, K$h = h for any h. 

2. For any t > 0, if we consider the OU(t) channel, given by the random transformation (|3.67p . then for any 
measurable function F such that E LF(y) < oo with Y in (|3.67p . we can write 



K t F(x)=E[F{Y)\X 



(3.80) 



and 



E[F(Y)]=E[K t F(X)}. (3.81) 
Here, (ET801) easily follows from (j3U?| . and (l378T|) is immediate from (ETHOj) . 

3. A particularly useful special case of the above is as follows. Let X have distribution P with P <C G, and 
let Pt denote the output distribution of the OU(t) channel. Then, as we have seen before, P± <C G, and 
the corresponding densities satisfy 



g t (x) = K t g(x). 



(3.82) 
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To prove (|3.82p . we can either use (|3.80p and the fact that gt{x) = E[g(y)|X = x], or proceed directly 
from (I3T67D : 



9t(x) 



-2t\ 



cxp 



(u — e t x) 2 



21 '] 



g\e x 



2(1 - e 
z ) exp 



#(u)du 

.2" 

' dz 



E 



g\e~ t x + a/1 - e"*Z 



where in the second line we have made the change of variables z 



u—e x 



VT- 



(3.83) 



and in the third line Z ~ G. 



4. The family of operators {K t }^ forms a semigroup, i.e., for any t\,t2 > we have 

which is shorthand for saying that Kt 1 +t 2 h = K^K^h) = Kt 1 (Kt 2 h) for any sufficiently regular h. 
This follows from (|3,80p and (|3.8ip and from the fact that the channel family {OU(t)}^ is ordered by 
degradation. For this reason, {Kt}^Q is referred to as the Ornstein-Uhlenbeck semigroup. In particular, 
if {lt}^ is the Ornstein-Uhlenbeck process, then for any sufficiently regular function F : R — > R we 
have 

K t F(x) =E[F(Y t )\Y = x], \/x 6 R. 

Two deeper results concerning the Ornstein-Uhlenbeck semigroup, which we will need, are as follows: 
Define the second-order differential operator C by 

Ch(x) = h"(x) — xh'(x) 

for all sufficiently smooth functions h : R — > R. Then: 

1. The Ornstein-Uhlenbeck flow {/if}^ , where ht = Kfh with sufficiently smooth initial condition ho = h, 
satisfies the partial differential equation (PDE) 

k = Ch t . (3.84) 

2. For Z ~ G and all sufficiently smooth functions g, h : R — > R we have the integration-by-parts formula 

E[g(Z)Ch(Z)] = E[h(Z)Cg(Z)} = -E[g' (Z)h' (Z)}. (3.85) 

We provide the details in Appendix I3.BI 

We are now ready to tackle the second term in (|3.78p . Noting that the family of densities {gt}t^o 
forms an Ornstein-Uhlenbeck flow with initial condition go = g, we have (assuming enough regularity 
conditions to permit interchanges of derivatives and expectations) 



dt 



«(*) 



E 



A{ si(z >r<"} 



d(t)E {g t (Z)) a(t) In g t (Z) + a(t)E 



a 



(t)E (g t (Z)) a(t) In g t (Z) + a(t)E (g t (Z)) a(t >- 1 £g t (Z) 



{ 9t {Z)Y {t) - l ^g t (Z) 

a(t)-l 



a(t)-l 



g't(z) 



(t)E[{g t (Z)) a{t) lng t (Z)]-a(t)E ((g t {Z)) 
(t) E \{g t (Z)) a{t) hxgt{Z)] - a(t){a(t) - l) E \{g t (Z)) a{t) - 2 [g' t {Z)f] (3. 



(3.86) 
(3.87) 
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where we use (IBTMj) to get (I3T86]) . and (1335]) to get dSHTJ). If we define the function cf) t (z) = (g t (z)) a{t)/2 , 
then we can rewrite ()3.88|) as 



dt 



{gt(z)) 



a(t) 



^E[^)ln^)]-lMfci)E 



a(t) 1 tv ' ty n a(t) 
Using the definition of <f>t and a substitution of (|3.89|) into the right-hand side of (|3.78|) gives that 

a 2 (t)E[tf(Z)]F(t) = d(t) (E[^ 2 (Z)ln^(Z)] -E[^ 2 (Z)]lnE[^(Z)]) -4(a(t) - 1)E [(^(Z)) 2 

If we now apply the Gaussian log-Sobolev inequality (|3.34|) to (pt, then from (|3.90p we get 

a 2 (t)E[tf(Z)]F(t) <2{a(t)-2(a{t) - 1))E \{cf>' t (Z))' 



(3.89) 



(3.90) 



(3.91) 

Since a(t) = l + (/3 — l)e 2t , then a(t) — 2(a[t) — 1) = and the right-hand side of (|3.91|) is equal to zero. 
Moreover, because a(t) > and 4> 2 {Z) > a.s. (note that 4> 2 > if and only if g t > 0, but the latter 
follows from (|3.83|) where g is a probability density function) then we conclude that F(t) < 0. 
What we have proved so far is that, for any /3 > 1 and any t > 0, 



D a[t) {Pt\\G) < 



<*(*)(£ -1) 
(3(a(t) - 1) 



Dp{P\\G) 



(3.92) 



where a(t) = 1 + (/3 — l)e 2t . By the monotonicity property of the Renyi divergence, the left-hand side 
of (13.92j) is greater than or equal to D a (Pt\\G) as soon as a < a(t). By the same token, because the 
function u £ (l,oo) i— >■ is strictly decreasing, the right-hand side of (|3,92p can be upper-bounded 

by { j^ a _|] (P\\G) for all a > a(t). Putting all these facts together, we conclude that the Gaussian 
log-Sobolev inequality (|3.34p implies (|3.76p . 

We now show that (|3.76p implies the log-Sobolev inequality of Theorem [2TJ To that end, we recall 
that ()3.76p is equivalent to the right-hand side of ()3.90p being less than or equal to zero for all t > and 
all /3 > 1. Let us choose t = and f3 = 2, in which case 

Q (0) = d(0) = 2, 0o = g. 

Using this in ()3.90p for t = 0, we get 

2(E [g 2 {Z)\ng 2 (Z)] - E[g 2 (Z)} lnE[g 2 (Z)]) - 4E \{g'(Z)) 2 } < 



which is precisely the log-Sobolev inequality (I3.34p where E[(/(Z)] = Eg [ 



dP] 

del 



1. 



□ 



As a consequence, we can establish a strong version of the data processing inequality for the ordinary 
divergence: 



Corollary 9. In the notation of Theorem 1231 we have for any t > 

D{P t \\G) <e- 2t D(P\\G). 
Proof. Let a = 1 + ee 2t and /3 = 1 + e for some e > 0. Then using Theorem 1231 we have 



D 1+£e2t (P t \\G) < 



e~ 2t + e 
f+e 



D 1+£ (P\\G), Vt>0. 



(3.93) 



(3.94) 



Taking the limit of both sides of (|3.94p as e 1 and using (13.721) (note that D a (P\\G) < oo for a > 1), 
we get (I3T93D . □ 
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3.3 Logarithmic Sobolev inequalities: the general scheme 

Now that we have seen the basic idea behind log-Sobolev inequalities in the concrete case of i.i.d. Gaussian 
random variables, we are ready to take a more general viewpoint. To that end, we adopt the framework 
of Bobkov and Gotze [53] and consider a probability space (Q, J 7 , fj,) together with a pair (A,T) that 
satisfies the following requirements: 

• (LSI-1) A is a family of bounded measurable functions on f2, such that if / G A, then af + b G A as 
well for any a > and & £ 1. 

• (LSI-2) r is an operator that maps functions in A to nonnegative measurable functions on $7. 

• (LSI-3) For any / G A, a > 0, and b G R, T(af + b) = aFf. 

Then we say that /j, satisfies a logarithmic Sobolev inequality with constant c > 0, or LSI(c) for short, if 



Remark 27. We have expressed the log-Sobolev inequality using standard information-theoretic nota- 
tion. Most of the mathematics literature dealing with the subject, however, uses a different notation, 
which we briefly summarize for the reader's benefit. Given a probability measure ji on Q and a nonneg- 
ative function g : O — > R, define the entropy functional 



D(^\\ri<^[(Tf) 2 ], V/GA 



(3.95) 




exp(/) 



and [•] denotes expectation w.r.t. 



^[exp(/)] 




= E^ln 5 ]-E M [<7] lnE^]. 
Then, the LSI(c) condition can be equivalently written as (cf. [53, p. 2]) 



(3.96) 




(3.97) 



with the convention that OlnO = 0. To see the equivalence of (|3.95p and (|3.97p . note that 

Ent M (exp(/)) 





E M [exp(/)].D(^)||/x) 



(3.98) 



and 




E^exp^-E^ [(T/) 2 ] . 



(3.99) 
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Substituting (|3.98p and (|3,99p into (|3.97p . we obtain (|3.95|) . We note that the entropy functional Ent is 
homogeneous of degree 1: for any g such that Ent^(g) < oo and any a > 0, we have 



Ent ^(ag) = aE p 



9 In 



9 



a Ent a (g). 



Remark 28. Strictly speaking, (|3.95p should be called a modified (or exponential) logarithmic Sobolev 
inequality. The ordinary log-Sobolev inequality takes the form 



Ent M ( 5 2 )<2c J (Tgfdfx 



(3.100) 



for all strictly positive g £ A. If the pair (A, V) is such that ipog £ A for any g £ A and any C°° function 
ij) : R — >■ R, and T obeys the chain rule 

T(i;og) = W og\Tg, V 9 eA^C°° (3.101) 

then (|3.95p and (|3.100p are equivalent. Indeed, if (|3. 100|) holds, then using it with g = exp(//2) gives 

Ent M (exp(/)) < 2c / (r(exp(//2))) 2 dfi 

c 

~ 2 



(r/) 2 exp(/)d,u 



which is (|3.97p . Note that the last equality follows from f|3. 101 1) which implies that 

r(exp(//2)) = \ exp(//2)-r/. 

Conversely, using (pQgP with / = 21ns- gives (where, from (|3.1()ip . it follows that r(21ng) = 212 for all 
strictly positive g £ A) 



Ent M ( 5 2 )<! J(r(2lng)) 2 g 2 dfi 
= 2cJ (Tgf&ii, 



which is (|3.100p . In fact, the Gaussian log-Sobolev inequality we have looked at in Section [3T21 is an 
instance, in which this equivalence holds with Tf = ||V/|| clearly satisfying the product rule (13. 101 [) . 

Recalling the discussion of Section I3.1.4( we now show how we can pass from a log-Sobolev inequality 
to a concentration inequality via the Herbst argument. Indeed, let f2 = X n and = P, and suppose that 
P satisfies LSI(c) on an appropriate pair (A,T). Suppose, furthermore, that the function of interest / is 
an element of A and that ||r/||oo < °o (otherwise, LSI(c) is vacuously true for any c). Then tf £ A for 
any t > 0, so applying (|3.95p to g = tf we get 



d[pM\\p) <^e { P (r(t/)) 



— E 



(tf) 



< 



c\\rf\\lt" 



(3.102) 



where the second step uses the fact that T(tf) = tTf for any f £ A and any t > 0. In other words, 
P satisfies the bound (|3.33p for every g £ A with E{g) = Hr^H^. Therefore, using the bound (|3. 102[) 
together with Corollary [71 we arrive at 

•2 



>E/(X")+r) <exp 



2c||r/|| 



Vr > 0. 



(3.103) 
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3.3.1 Tensorization of the logarithmic Sobolev inequality 

In the above demonstration, we have capitalized on an appropriate log-Sobolev inequality in order to 
derive a concentration inequality. Showing that a log-Sobolev inequality actually holds can be very 
difficult for reasons discussed in Section 13,1.31 However, when the probability measure P is a product 
measure, i.e., the random variables X±, . . . ,X n G X are independent under P, we can, once again, use 
the "divide-and-conquer" tensorization strategy: we break the original n-dimensional problem into n 
one-dimensional subproblems, then establish that each marginal distribution Px t , i = 1, • • • ,n, satisfies 
a log-Sobolev inequality for a suitable class of real-valued functions on X, and finally appeal to the 
tensorization bound for the relative entropy. 

Let us provide the abstract scheme first. Suppose that for each i G {l,...,n} we have a pair 
(Ai,Ti) defined on X that satisfies the requirements (LSI-l)-(LSI-3) listed at the beginning of Sec- 
tion [331 Recall that for any function / : X n —> R, for any % G {l,...,n}, and any (n — l)-tuple 
x % = (x\, . . . , Xi-i, Xi + i, . . . , x n ), we have defined a function fi{-\x l ) : X — y R via fi(xi\x l ) = f(x n ). Then, 
we have the following: 

Theorem 24. Let X\, . . . , X n G X be n independent random variables, and let P = Px 1 ® ■ ■ ■ <8> Px n be 
their joint distribution. Let A consist of all functions / : X n — > M such that, for every i G {1, . . . , n}, 

fi(-\&) G A, Vx* e X n ~ l . (3.104) 

Define the operator T that maps each / G A to 



n 

J2(Tifi) 2 , (3.105) 

i=i 

which is shorthand for 



Y J (^ifi{xi\x i ))\ Vx n £X n . (3.106) 

8=1 

Then, the following statements hold: 

1. If there exists a constant c > such that, for every i, Px t satisfies LSI(c) with respect to (Ai,Ti), then 
P satisfies LSI(c) with respect to (A,T). 

2. For any f £ A with E[f(X n )} = 0, and any r > 0, 

Fl/mSrlS-P^j^r). (3.107) 

Proof. We first check that the pair (A,T), defined in the statement of the theorem, satisfies the require- 
ments (LSI-l)-(LSI-3). Thus, consider some / G A, choose some a > and b G R, and let g = af + b. 
Then, for any i and any 

9i{-\x l ) = g(xi, . . . ,Xj_i, -,x i+ i, ... ,x n ) 

= af{xi, . . . ,Xi-x, -,Xi+i, ■■■ ,x n ) + b 
= afi(-\x i ) + b£A h 

where the last step uses (|3.104p . Hence, / G A implies that g = af + b G A for any a > 0, b G R, so 
(LSI-1) holds. From the definitions of T in (|3.105|) and (|3. 106j) it is readily seen that (LSI-2) and (LSI-3) 
hold as well. 




1 
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Next, for any / G A and any t > 0, we have 



n 

D(pW\\p) <J2 D ( P : 



i=l 



(*/) 
Xi|X< 



x 4 y 



i=i J 

n „ 



i=i 

ct 2 



1=1 



j(*/i(-l* 4 )) 

Xj 



^X 



ct* 



ct 



n 

J2 e pH { e( p x ] [ piMXilx*)) 2 \x*] } 



i=i 



\ ■ 4 /} [(r/) 2 ] 



(3.108) 



where the first step uses Proposition [5] with Q = pw) , the second is by the definition of conditional 
divergence where Px % = Px t \x^ ^he third is due to (I3.25p . the fourth uses the fact that (a) fi(-\x l ) G Ai 
for all x l and (b) Px % satisfies LSI(c) w.r.t. (Ai,Ti), and the last step uses the tower property of the 
conditional expectation, as well as (13. 105)) . We have thus proved the first part of the theorem, i.e., that 
P satisfies LSI(c) w.r.t. the pair (A,T). The second part follows from the same argument that was used 
to prove (I3.103jl . □ 



3.3.2 Maurer's thermodynamic method 

With Theorem [55] at our disposal, we can now establish concentration inequalities in product spaces 
whenever an appropriate log-Sobolev inequality can be shown to hold for each individual variable. Thus, 
the bulk of the effort is in showing that this is, indeed, the case for a given probability measure P and a 
given class of functions. Ordinarily, this is done on a case-by-case basis. However, as shown recently by 
A. Maurer in an insightful paper [139], it is possible to derive log-Sobolev inequalities in a wide variety 
of settings by means of a single unified method. This method has two basic ingredients: 

1. A certain "thermodynamic" representation of the divergence ||/i), / 6 A, as an integral of the 
variances of / w.r.t. the tilted measures for all t G (0, 1). 

2. Derivation of upper bounds on these variances in terms of an appropriately chosen operator T acting on 
A, where A and T are the objects satisfying the conditions (LSI-l)-(LSI-3). 

In this section, we will state two lemmas that underlie these two ingredients and then describe the overall 
method in broad strokes. Several detailed demonstrations of the method in action will be given in the 
sections that follow. 

Once again, consider a probability space (fl,J-,n) and recall the definition of the ^-tilting of /j,: 

&ir- 9 > exp(g) 
dfi E M [exp(5)]' 

The variance of any h : £1 —> M. w.r.t. ^jS 9 ' is then given by 

var(f)[/ l ]^E(f)[/ l 2 ]-(E(f)[/ l ]) 2 . 
The first ingredient of Maurer's method is encapsulated in the following (sec [139, Theorem 3]): 
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Lemma 9 (Representation of the divergence in terms of thermal fluctuations). Consider a function 
/ : Q ->• R, such that E M [exp(A/)] < oo for all A > 0. Then 



(A/) I 



/■'• 



A r \ 



var^[/] dsdt. 



(3.109) 



Remark 29. The thermodynamic interpretation of the above result stems from the fact that the tilted 
measures can be viewed as the Gibbs measures that are used in statistical mechanics as a probabilistic 
description of physical systems in thermal equilibrium. In this interpretation, the underlying space ft 
is the state (or configuration) space of some physical system S, the elements x S are the states (or 
configurations) of E, n is some base (or reference) measure, and / is the energy function. We can view 
[A as some initial distribution of the system state. According to the postulates of statistical physics, the 
thermal equilibrium of E at absolute temperature 6 corresponds to that distribution v on that will 
globally minimize the free energy functional 



(3.110) 



It is claimed that ^e(v) is uniquely minimized by v* = yy'^i, where t = 1/6 is the inverse temperature. 
To see this, consider an arbitrary v, where we may assume, without loss of generality, that v <C \i. Let 
tp = dv/dfi. Then 



dv 



d/i(-*/3 



d/x(-*/) 



cxp(-t/) 
E M [cxp(-t/)] 



^exp(t/)E M [exp(-t/)] 



and 



1 



= - E„ [*/ + In $ 



E v [ln(^exp(t/))] 



t 
1 

1 



In- 



d^ 



AH) 



dni-tf) 

D{u\\^- tf) ) - A(-t 



where, as before, A(— t) = ln(E /Lt [exp(— tf)]j is the logarithmic moment generating function of / w.r.t. fi. 
Therefore, ^e(^) = ^i/t( u ) — — A(— t)/t, with equality if and only if v = /i' - '^. 

In the context of the relation in (I3.110j) between the free energy functional and information divergence, 
the reader is referred to a recent monograph by Merhav [140] that highlights some interesting relations 
between information theory and statistical physics. This monograph relates thermodynamic potentials 
(like the thermodynamical entropy and free energy) to information measures (like the Shannon entropy 
and information divergence); it also provides some rigorous mathematical tools that were triggered by 
the physical point of view, and which prove to be useful in dealing with information-theoretic problems. 

Now we give the proof of Lemma [9j 

Proof. We start by noting that (see (|3.11jl ) 

A'(i)=Ef)[/] and A"(t) = 

and, in particular, A'(0) = E M [/]. Moreover, from (|3.13p . we get 



va 



r^[/], 



(3.111) 



(3.112) 



3.3. LOGARITHMIC SOBOLEV INEQUALITIES: THE GENERAL SCHEME 



109 



Now, using (|3.11ip . we get 



A 



AA'(A) = / A'(A)dt 



o 



A 



j\"(s)ds + A'(0^j dt 

j\* r M\f\d8 + Ep\f\\dt (3.113) 



and 



r-A 

A(A) = / A'(i)dt 
) 



J (J A"(s)dfl + A'(0)^ dt 

J (Yvar^%]d S + E M [/A dt. (3.114) 



Substituting (i3~TT3l) and (fHTTTil into (|3.112p . we get (l3~T09D . □ 



Now the whole affair hinges on the second step, which involves bounding the variances var^ [/], for 
t > 0, from above in terms of expectations E^' 1 [(r/) 2 ] for an appropriately chosen F. The following is 
sufficiently general for our needs: 



Theorem 25. Let the objects (A, T) and {(Ai, Ti)}f =1 be constructed as in the statement of Theorem! 
Furthermore, suppose that for each i G {1, . . . , n}, the operator Tj maps each g G A% to a constant (which 
may depend on g), and there exists a constant c > such that the bound 

varf 9) WXJIX* = a*] <c(T ig f , Vx* G A' n_1 (3.115) 

holds for all i G {1, . . . , n}, s > 0, and g G >4j, where var^ = x l ] denotes the (conditional) variance 

w.r.t. Px\x i -x i ' Then, the pair (A,T) satisfies LSI(c) w.r.t. Px n - 

Proof. Given a function / G A then, by construction, /j : Xi — >• K is in ^ for each i G {1, . . . , n}. We 
can write 



d(p {s) 



^c^/i) 2 / / dsdt 
Jo Jt 



2 

where the first step uses the fact that P K X ■ < xi _^i is equal to the /j(-|x 4 )-tilting of Px z , the second step uses 

LemmaEl and the third step uses (|3.115|) with g = fi(-\x l ). We have therefore established that, for each 
i, the pair (Ai,Ti) satisfies LSI(c). Therefore, the pair (A,T) satisfies LSI(c) by Theorem [241 □ 

The following two lemmas will be useful for establishing bounds like (I3.115P : 
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Lemma 10. Let U be a random variable such that U E [a, b] a.s. for some — oo < a < b < +oo. Then 

{b-af 



var[U] < 



(3.116) 



Proof. Since the support of U is the interval [a, b], then the maximal variance of U is attained when the 
random variable U is binary, getting each of the endpoints of this interval with probability ^. The bound 
in (|3,116|) is achieved with equality in the latter case. □ 



Lemma 11. [139, Lemma 9] Let / be a real- valued function such that / — E„[/] < C for some C £ R. 
Then, for every t > 0, 



var^m^exp^var^/]. 



Proof. 



varf)[/]=varf){/-E /1 [/] } 
\f~W\f 



(J, 



E„ 



exp(t/)(/-E^[/]) 2 



E M [exp(t/)] 
<E M {(/-E M [/]) 2 exp [t(/-E M [/])]} 
<exp(tC)E M R/-E M [/]) 2] 



(3.117) 
(3.118) 

(3.119) 

(3.120) 
(3.121) 



where: 

(|3,117p holds since var[/] = var[/ + c] for any constant c£R; 

(I3TTT5D uses the bound var[[7] < E[U 2 }; 

(|3.119p is by definition of the tilted distribution n^f* ; 

(|3.120p follows from applying Jensen's inequality to the denominator; and 

(|3.12ip uses the assumption that / — E M [/] < C and the monotonicity of exp(-) (note that t > 0). 

This completes the proof of Lemma QTJ □ 

3.3.3 Discrete logarithmic Sobolev inequalities on the Hamming cube 

We now use Maurer's method to derive log-Sobolev inequalities for functions of n i.i.d. Bernoulli random 
variables. Let X be the two-point set {0, 1}, and let G X n denote the binary string that has 1 in the 
ith position and zeros elsewhere. Finally, for any / : X n — > K define 



r/(x 



n\ A 



^(/(^eeij-Ztx")) 2 , Vx n ex n , 



(3.122) 



i=i 



where the modulo-2 addition © is defined componentwise. In other words, Tf measures the sensitivity of 
/ to local bit flips. We consider the symmetric, i.e., Bernoulli (1/2), case first: 
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Theorem 26 (Discrete log-Sobolev inequality for the symmetric Bernoulli measure). Let A be the set 
of all the functions / : X n — > M. Then, the pair (A, T) with T defined in (I3.122p satisfies the conditions 
(LSI-l)-(LSI-3). Let X\, . . . ,X n be n i.i.d. Bernoulli(l/2) random variables, and let P denote their 
distribution. Then, P satisfies LSI(l/4) w.r.t. (A,T). In other words, for any / : X n — >• M, 

D(PW\\P) < - \{Tf) 2 } . (3.123) 
8 L 

Proof. Let Ao be the set of all functions g : {0, 1} — > M, and let Tq be the operator that maps every 
9 G Aq to 

Tg±\g{<S)-g(l)\ = \g{x)-g{x®l)\, VxG{0,l}. (3.124) 

For each i G {1, . . . ,n}, let (Ai,Ti) be a copy of (Ao,Tq). Then, each Tj maps every function g G A% to 
the constant |g(0) — <?(1)|. Moreover, for any g G ^4j, the random variable f7, = g{Xi) is bounded between 
^(O) and 5(1), where we can assume without loss of generality that g(0) < g(l). Hence, by Lemma [TU| 
we have 

VB^lgiX^X* = x 1 ) < (9(0) ~ 9(l)) 2 = (T|)! ) Wg € Au -i e X n-i (3 125) 

In other words, the condition (|3.115p of Theorem 1251 holds with c = 1/4. In addition, it is easy to see that 
the operator T constructed from Y\, . . . , r„ according to (|3.105p is precisely the one in ()3.122p . Therefore, 
by Theorem I25|. the pair (A, T) satisfies LSI(l/4) w.r.t. P, which proves ()3.123p . This completes the proof 
of Theorem [26l □ 



Remark 30. The log-Sobolev inequality in (|3.123p is an exponential form of the original log-Sobolev 
inequality for the Bernoulli(l/2) measure derived by Gross |44| . which reads: 

Entpf, 2 ] < (g(0) 7 (1))2 . (3.126) 

To see this, define / by e* = g 2 , where we may assume without loss of generality that < g(0) < g(l). 
To show that (|3.126|) implies (|3.123p . note that 

( 5 (0) - g(l)) 2 = (exp (/(0)/2) - exp (/(l)/2)) 2 

<i[exp(/(0)) + exp(/(l))] (/( )-/(l)) 2 

= ±E P [exp(/)(r/) 2 ] (3.127) 

with Tf = |/(0)— /(1)|, where the inequality follows from the easily verified fact that (1— x) 2 < )( ln:c ) 
for all x > 0, which we apply to x = g(l)/g(0). Therefore, the inequality in (|3. 126|) implies the following: 

D(PW\\P) 
Ent P [exp(/)] 



E P [exp(/)] 
Ent P [g 2 } 



(3.128) 
(3.129) 

(g(0)-g(l)) 2 

< y y ' - " (3.130) 



E P [exp(/)] 

(ff(Q) - gJX 

2E P [exp(/)] 



^ Ep[exp(/) (r/) 2 ] 
" 8Ep[exp(/)] 

= l -¥!-P[{Tf) 2 } (3.132) 
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where equality f|3. 128|) follows from f|3.98j) . equality (|3.129|) holds due to the equality e' = g 2 , inequal- 
ity <^Jm holds due to ([3TT26D . inequality pTTHTI) follows from (pmTD . and equality (lXT32j) follows by 
definition of the expectation w.r.t. the tilted probability measure P"'. Therefore, it is concluded that 
indeed <^J26h implies ([3TT23jh 

Gross used (|3.126p and the central limit theorem to establish his Gaussian log-Sobolev inequality (see 
Theorem I21h . We can follow the same steps and arrive at (|3.44p from (j3. 123[) . To that end, let g : M — > M 
be a sufficiently smooth function (to guarantee, at least, that both gexp(g) and the derivative of g are 
continuous and bounded), and define the function / : {0, l} n — > R by 



f {x\ , ■ ■ ■ , x r 



x\ + x 2 + ■ ■ ■ + x n - n/2 

' 7^ , 



If Xi, . . . ,X n are i.i.d. Bernoulli (1/2) random variables, then, by the central limit theorem, the sequence 
of probability measures {Pz n }%Li with 



X 1 + ... + X n -n/2 
^n~/l 



oo. Therefore, by the assumed 



converges weakly to the standard Gaussian distribution G as n 
smoothness properties of g we have (see (|3.96p and (I3.98j) ) 

E [exp (/(*»))] .D{P$\\P X «) = E [/(*») exp (/(*"))] - E[exp (/(X"))] lnE[exp {f(X n ))] 

= E [g{Z n ) exp {g(Z n ))] - E[exp {g{Z n ))\ lnE[exp (g(Z n j)\ 
^> E [g(Z) exp (g(Z))] - E[exp (g(Z))} lnE[exp (g(Z))} 
= E[exp(g(Z))]D(P { z 9) \\P z ) (3.133) 

where Z ~ G is a standard Gaussian random variable. Moreover, using the definition (|3.122p of T and 
the smoothness of g, for any i G {1, . . . , n} and x n 6 {0, l} n we have 



\f{x n ®e i )-f{x n )t 



x\ + . . . + x n - n/2 (-1) 



4 
n 



x\ + . . . + x n — n/2 



2 



X\ + ■ ■ ■ + x n — n/2 



+ o 



which implies that 



Consequently, 



|r/(** 



]T (/(*" e e 4 ) - /(x™)) 2 

xi + . . . + x n — n/2 



i=l 



4 g' 



Vn/l 



E [exp (/(*»))]• Rr/(X")) S 



+ o(D. 



E exp (/(*»)) (r/(X*)) 
IE exp( 5 (Z n )) ((</(Z n )) z + o(l) 



^ 4E 



exp( 5 (Z)) (</(Z)) 
4E[exp( 5 (Z))]-Eg[(</(Z)) 



(3.134) 
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Taking the limit of both sides of (|3. 123j) as n — > oo and then using (|3.133j) and (|3. 134[) , we obtain 

H4 9) \\Pz)<~^l[(9'(Z)f , 

which is (J33D- 

Now let us consider the case when X\, . . . ,X n are i.i.d. Bernoulli(p) random variables with some 
p 7^ 1/2. We will use Maurer's method to give an alternative, simpler proof of the following result of 
Ledoux [5H Corollary 5.9]: 

Theorem 27. Consider any function / : {0, l} n — > M with the property that there exists some c > 
such that 

max \f(x n ® ei ) - f(x n )\ <c (3.135) 

ie{l,...,n} 

for all x n G {0, l} n . Let X%, . . . ,X n be n i.i.d. Bernoulli(p) random variables, and let P be their joint 
distribution. Then 

D{P^\\P) < pq ( (C - 1)e J (c) + 1 ) E« [ { Tff] , (3.136) 

where q = 1 — p. 

Proof. Following the usual route, we will establish the n = 1 case first, and then scale up to an arbitrary 
n by tensorization. 

Let a = |T(/)| = |/(0) — /(1)|, where T is defined as in (|3. 124|) . Without loss of generality, we may 
assume that /(0) = and /(l) = a. Then 

E[f}=pa and var[f]=pqa 2 . (3.137) 

Using (|3.137p and Lemma [TTJ since / — E[/] < a — pa = qa, it follows that for every t > 

varp [/] < pqa 2 exp(tga). 

Therefore, by Lemma [9] we have 

D{P {f) \\P) <pqa 2 I f exp(sqa) ds dt 
Jo Jt 

= pqa 



o Jt 

2 ( (qa — 1) exp(qa) + 1 



< pqa 



(qa) 2 

2 / (c - 1) exp(c) + 1 



c 2 



where the last step follows from the fact that the function u ^ u 2 [(u — 1) exp(-u) + 1] is nondecreasing 
in u > 0, and < qa < a < c. Since a 2 = (Tf) 2 , we can write 



d{pW\\p) < Pq ( (c " 1)ex 2 p(c) + 1 l 4 /} t(r/) 2 



so we have established (|3.136p for n = 1. 

Now consider an arbitrary n £ N. Since the condition in (|3.135p can be expressed as 

l/iCOla*) - <c, Vie{l,...,n},i i e{0,ir 1 , 
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we can use ([3.136P to write 



D (p{th{-\^)) 



Px) < PQ ( (C " 1} c e 2 XPC+1 ) Ef^ [ (TMxmf 



X' 



for every i = 1, . . . , n and all x % G {0, l} n . With this, the same sequence of steps that led to (j3. 1Q8|) in 
the proof of Theorem 1241 can be used to complete the proof of (|3,136p for arbitrary n. □ 

Remark 31. In order to capture the correct dependence on the Bernoulli parameter p, we had to use 
a more refined, distribution-dependent variance bound of Lemma [TTJ as opposed to the cruder bound of 
Lemma[Tn]that does not depend on the underlying distribution. Maurer's paper [139] has other examples. 

Remark 32. The same technique, based on the central limit theorem, that was used to arrive at the 
Gaussian log-Sobolev inequality ()3.44p can be utilized here as well: given a sufficiently smooth function 
g : R -> R, define / : {0, l} n -> R by 



fix 11 ) ± g 



xi + . . . + x n - np 
Jnpq 



and then apply (13.1361) to it. 



3.3.4 The method of bounded differences revisited 

As our second illustration of the use of Maurer's method, we will give an information-theoretic proof 
of McDiarmid's inequality with the correct constant in the exponent (recall that the original proof in 
[6l [39] used the martingale method; the reader is referred to the derivation of McDiarmid's inequality 
via the martingale approach in Theorem [2] of the preceding chapter) . Following the exposition in [139] 
Section 4.1], we have the following re-statement of McDiarmid's inequality in Theorem [2j 

Theorem 28. Let X±, . . . ,X n G X be independent random variables. Consider a function / : X n —> R 
with E[f(X n )] = 0, and also suppose that there exist some constants < c±, . . . , c n < +oo such that, for 
each i £ {1, . . . , n}, 

\fi{x\x l ) - h{y\x l )\ < a, Vx,y G X, x { G X n ~ x . (3.138) 



Then, for any r > 0, 



nf(X n )>r) <exp(-^|— ^ ). (3.139) 



Proof. Let „4o be the set of all bounded measurable functions g : X — > R, and let Tq be the operator 
that maps every g G -4o to Tq g = sup xg ^ g(x) — inf^g^- g(x). Clearly, FQ(ag + b) = aFog for any a > 
and b G R. Now, for each i G {1, . . . ,n}, let (Ai,Ti) be a copy of (^OiFo)- Then, each Tj maps every 
function g G A% to a non-negative constant. Moreover, for any g G At, the random variable Ui = g(X{) 
is bounded between inf xe ^ g{x) and s\xp xeX g(x) = inf^g^ g{x) + T^g. Therefore, Lemma [TU1 gives 

varN[<7pQ)|jr = x'} < V g G A h x { G X n ~\ 

Hence, the condition (|3.115p of Theorem 1251 holds with c = 1/4. Now let A be the set of all bounded 
measurable functions / : X" — > R. Then, for any / G A, i G {1, . . . , n} and x n G X n , we have 

sup f(xi, . . . ,Xi, . . . ,x n ) - inf f(xi, . . . ,Xi, . . . ,x n ) 
= sup fi(xi\x l ) - inf fi(xi\x l ) 
= Tif t {-\x 1 )- 
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Thus, if we construct an operator T on A from Ti, . . . ,T n according to (|3.105|) . the pair (A, T) will satisfy 
the conditions of Theorem [2H Therefore, by Theorem I25| it follows that the pair (A, T) satisfies LSI(l/4) 
for any product probability measure on X n . Hence, inequality (j3. 103p implies that 

> r) < exp (-jj^jjrl (3- 14 °) 

holds for any r > and bounded / with E[/] = 0. Now, if / satisfies ()3.138f) then 

n 

\\Tf\\l= sup Y, {^ifiixilx 1 )) 2 

x n £X n . =1 
n 

< ^2 sup (Tifi(xi\x 1 )) 



=1 x n ex n 



V sup \f i {x i \x i ) - fMx^ 2 

i=l x n £X n ,y£X 
n 



< 



Substituting this bound into the right-hand side of ([3. 140j) gives (|3.139p , □ 

It is instructive to compare the strategy used to prove Theorem [28] with an earlier approach by 
Boucheron, Lugosi and Massart [141] using the entropy method. Their starting point is the following 
lemma: 

Lemma 12. Define the function ip : R — > R by ip(u) = exp(-u) — u — 1. Consider a probability space 
(£l,J-,n) and a measurable function / : O — > TSL such that tf is exponentially integrable w.r.t. fi for all 
t € R. Then, the following inequality holds for any c £ R: 

D(^\\^) < Ef ) ty( - t(/ - c))] . (3.141) 

Proof. Recall that 

r (t/) M ■ i- ex P( tc ) 



tEW\f\-tc + \R: 



^[exp(tf)\ 

Using this together with the inequality ln-u < u — 1 for every u > 0, we can write 



tEW)[/]-te + E p 



exp(t(/ + c))exp(-t/) 



E^exp(t/)] 

= tEjtf) [/] + exp(te) Ef ) [exp(-t/)] - te - 1, 
and we get (|3.14ip . This completes the proof of Lemma [12] □ 

Notice that, while (|3.14ip is an upper bound on D the thermal fluctuation representation 

(|3.109p of Lemma [9] is an exact expression. Lemma [12] leads to the following inequality of log-Sobolev 
type: 
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Theorem 29. Let X±, . . . ,X n be n independent random variables taking values in a set X, and let U = 
f(X n ) for a function / : X n — > R. Let P = Px n = Pxi ® • • -®Px n be the product probability distribution 
of X n . Also, let X[, . . . , X' n be independent copies of the Aj's, and define for each i E {!,... ,n} 



i+1 j • • • > X n 



Then, 



n 

D (p(tf) ||p) < exp ( - A(t)) ^ E [exp(t*7) V (— 1(27 - C/ (i) 



(3.142) 



i=l 



where ^(u) — exp(ii) — u — 1 for u E R, A(i) = lnE[exp(£[/)] is the logarithmic moment-generating 
function, and the expectation on the right-hand side is w.r.t. X n and {X') n . Moreover, if we define the 
function r : R — > R by r(-u) = u(exp(u) — l) for «eR, then 



n 

D(PW\\P) < exp ( - A(t)) ^E [exp(t£7) r (-*(£/ - 1 



{£/>[/«} 



i=l 



and 



D(pW) \\P) < exp ( - A(t)) ^E [exp(tZ7) r (— *(LT - f/ (i) )) 1 



{{/<l/«} 



i=l 



(3.143) 



(3.144) 



Proof. Applying Proposition [5] to P and Q = P^, we have 



(*/) 



8=1 



1=1 

n 



i=l 



(3.145) 



Fix some i G {1, . . . , n}, x l G Af n 1 , and x^ G A'. Let us apply Lemma[T2]to the ith term of the summation 
in (|3.145p with /i = P Xl , / = and c = /(xi, . . . , Xi-i, x^,x i+ i, ...,x n ) = /i(x-|x J ) to get 



DIP 



,(*/i(-|2 4 )) 



Substituting this into (|3. 145|) . and then taking expectation w.r.t. both X n and (X') n , we get (|3. 142j) . 
To prove (|3.143p and (I3.144p . let us write (note that ip(0) = 0) 



exp 



(tf/)v(-t(f/-c/ (i) )) 

= exp(ttf) V (~t(U - U®)) l {u>U d) } + exp(tU) V (t(tf W - U)) l {u<uWy (3.146) 

Since Aj and X| are i.i.d. and independent of X 1 , we have (due to a symmetry consideration which follows 
from the identical distribution of U and U®) 



E 



exp(tU)il>(t(U®-uj) l {u<u a) } 



X' 



X' 



exp(tU®)1>[t(U-U®)) l {u>uii)} 
~exp(tU) exp (t(U® - U)) V (t(U - l {u>u(l)} 



X 1 
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Using this and (|3. 146j) . we can write 



E 



exp(tU)ip (-t(U-U®) 
E {exp(tC7) tj) (t{U® - 17)) + exp (t(U® - 17)) V (t(C7 - C/ (i) )) 



1 



{[/>£/( 



0}}- 



Using the equality tp(u) + exp(u)ip(—u) = t(u) for every u £ R, we get ([3. 143|) . The proof of (|3. 144|) is 
similar. □ 



Now suppose that / satisfies the bounded difference condition in f|3. 138 j) . Using this together with 
the fact that r(— u) = u(l — exp(— u)) < u 2 for every u > 0, then for every i > we can write 

n 

D(PM\\P) < exp ( - A(t)) £e [exp(^) r( - *(£/ - £/«)) l {u>uV)} 

i=l 

n 

< t 2 exp ( - A(i)) ^ E [expQtU) {U - U^f l {u>U ( i)} 

i=i 

n 

< t 2 exp ( - A(i)) ^ cf E [exp(i*7) l {f/>[/W} 

8=1 

< t 2 exp ( - A(t)) ( £ c| ) E [exp((t/)] 



K i=l 



,i=l 



Applying Corollary [71 we get 

7(X")>E/(X")+r) <exp 



Vr > 



which is weaker than McDiarmid's inequality (|3.139j) since the constant in the exponent here is smaller 
by a factor of 8. 

3.3.5 Log-Sobolev inequalities for Poisson and compound Poisson measures 



= e A for every n G No where 



Let Pa denote, for some A > 0, the Poisson(A) measure, i.e., Pa(^) 
N = N U {0} is the set of the non-negative integers. Bobkov and Ledoux [54] have established the 
following log-Sobolev inequality (this result is obtained by combining the log-Sobolev inequality in 
Corollary 7] with equality (|3.98p ): for any function / : Nq — > R, 



D[ P ( x /) ||Pa) < AE^ 



(17) e 



r/ _ _r/ 



+ 1 



where V is the modulus of the discrete gradient: 

r/(x)^|/(x)-/(x + i)|, 



Vx G N . 



(3.147) 



(3.148) 



Using tensorization of (|3. 147|) . Kontoyiannis and Madiman |142] gave a simple proof of a log-Sobolev 
inequality for the compound Poisson distribution. We recall that a compound Poisson distribution is 
defined as follows: given A > and a probability measure [i on N, the compound Poisson distribution 
CPa.^ is the distribution of the random sum 



N 



(3.149) 



i=l 
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where N ~ P\ and X\,Xz, ■ ■ ■ are i.i.d. random variables with distribution [a, independent of N (if N 
gets the value zero then Z is defined to be zero). 

Theorem 30 (Log-Sobolev inequality for compound Poisson measures [142J). For an arbitrary probability 
measure \x on N and an arbitrary bounded function / : Nq — > R, and for every A > 0, 



d(cp2||cp a , m ) <a^#)4p„ [(r fc /)e r ^-r fe / + i 



(3.150) 



k=l 



where Fkf{x) = \ f(x) — f(x + k)\ for each k,x £ No- 

Proof. The proof relies on the following alternative representation of the CP\^ probability measure: 
Lemma 13. If Z ~ CPa,^, then 



z = Y,kY k , y fc ~p 



Xfi{k) 



(3.151) 



k=l 



where are independent random variables, and = means equality in distribution. 

Proof. The characteristic function of Z in (13. 151 H is equal to 

<p z {y) = E[exp(jVZ)] = exp |a ^2fi(k) exp{juk) - l\ J , \/ u € M 

which coincides with the characteristic function of Z ~ CPa,^ in (|3.149p . The proof of Lemma [131 follows 
from the fact that two random variables are equal in distribution if and only if their characteristic 
functions coincide. □ 

For each n £ N, let P n denote the product distribution of Y\, . . . , Y n . Consider an arbitrary function 
/ : N -> R, and define the function g : (N ) n -> R by 



g(yi,---,Un) - f \ ^ky k , Vyi,...,y n G N . 



vfc=l 



If we now denote by P n the distribution of the sum = X)fc=i then 



E 



Pn 



exp (/(S n )) 
Epjexp (/(5 n ))] 

exp(ff(Y")) 
Epjexp {g{Y n ))\ 

D{P^\\Pn) 



E 



Pn 



In 



hi 



exp(/QS ra )) N 
Epjexp (/(S n ))], 

' exp(g(y")) 
Epjexp 



fc=i 



(9) 

Y k \Y k 



(3.152) 



where the last line uses Proposition [5] and the fact that P n is a product distribution. Using the fact that 

dR {9) 



Y k \Y h =y k 



exp (g k (-\y k )) 



dP Y k Epw fc) [exp( 5fc (y fc |y fc ))] 
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and applying the Bobkov-Ledoux inequality in (|3.147j) to Py k and all functions of the form gk{-\y k ), we 
can write 



D[P^ k \\P Yk \P$) <Mk)^ n 



(r 9k (Y k \Y k )) t ?9*(Y k \Y») _ e Vg k (Y k \Y k ) + l 



(3.153) 



where T is the absolute value of the "one-dimensional" discrete gradient in (13. 148j) . For any y n £ (No) n , 
we have 



^9k{yk\y k ) = 9kiyk\y k ) - 9k(yk + 



jE{l,...,n}\{k} 



f [ Kyk + 1) + Yl W 

je{l,...,n}\{fc} 



i=i / \i=i 

= r fc / 

Using this in ()3.153p and performing the reverse change of measure from P n to P n , we can write 



D[p^ k \\P Yk \P^) < \»{k)E% [(r k f(S n ))e r ^-e^^ 
Therefore, the combination of (|3.152p and ()3.154p gives 

n 

D(pW\\P n ) <Aj>(*)EK! [^kf)e T " f -e r ^ + 1 

k=l 

oo 

< a j2 E K } fa/) erfe/ - eFfc/ + 1 



+i 



(3.154) 



(3.155) 



k=l 



where the second line follows from the inequality xe x — e x + 1 > that holds for all x > 0. 

Now we will take the limit as n — > oo of both sides of (|3.155p . For the left-hand side, we use the fact 
_ — ( f\ ( f) 

that, by (|3.15ip . P n converges in distribution to CP\ :I1 as n — > oo. Since / is bounded, Pk J ' -> CP%' 

in distribution. Therefore, by the bounded convergence theorem (whose use can be justified by the fact 

that / is bounded), we have 



lin L D(PP\\P n )=D[cp[ f l\\CP x ^ 

For the right-hand side, we have 

oo 

X>(*0^ \(T k f)e r ^ -e r ^ + 1 



(3.156) 



k=l 



E 



(/) 



.k=l 



Y,Kk) (Tkf) e Tkf - e r ^ + 1 



A . (j, 



f2Kk)(^kf)e Tkf -e r «f + 1 



k=l 



oo 

]T »(k) Eg Ai(t [(T fe /) e r ^ - e r ^ + l] (3.157) 



fc=i 



where the first and the last steps follow from Fubini's theorem, and the second step follows from the 
bounded convergence theorem. Putting (|3.155p - (|3.157p together, we get the inequality in (|3,150p . This 
completes the proof of Theorem [301 Q 
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3.3.6 Bounds on the variance: Efron— Stein— Steele and Poincare inequalities 

As we have seen, tight bounds on the variance of a function f(X n ) of independent random variables 
X\, . . . ,X n are key to obtaining tight bounds on the deviation probabilities F(f(X n ) > Kf(X n ) + r) 
for r > 0. It turns out that the reverse is also true: assuming that / has Gaussian-like concentration 
behavior, 

W(f(X n ) > Ef(X n ) +r) < K exp ( — nr 2 ) , Vr > 

it is possible to derive tight bounds on the variance of f(X n ). 

We start by deriving a version of a well-known inequality due to Efron and Stein |143j . with subsequent 
refinements by Steele [144J. 

In the following, we say that a function / is "sufficiently regular" if the functions tf are exponentially 
integrable for all sufficiently small t > 0. 

Theorem 31. Let Xi, . . . , X n be independent X- valued random variables. Then, for every sufficiently 
regular / : X n -»• R, 

n 

var[/(X")] <^E{var[/(X")|A >i ]}. (3.158) 
i=i 

Proof. By Proposition for any t > 0, we have 

n 

D (pm\\ P) ^ D{P w)\\ PXi \ P - iy 



i=i 



Using Lemma [9J we can rewrite this inequality as 



/ / var( T/ )[/]drds < VE / / var^l^^X^)] drds 

JO Js -i Uo Js 



i=l 

Dividing both sides by t 2 , passing to the limit of t — > 0, and using the fact that 

ft rt 

'0 Js 



we get (|3.158|) . □ 

Next, we discuss the connection between log-Sobolev inequalities and another class of functional 
inequalities: the Poincare inequalities. Consider, as before, a probability space (Q,J-,fi) and a pair 
(A,T) satisfying the conditions (LSI-l)-(LSI-3). Then, we say that satisfies a Poincare inequality with 
constant c > if 

var M [/] < cE M [(Tf) 2 ] , V/ € A (3.159) 

Theorem 32. Suppose that fi satisfies LSI(c) w.r.t. (A,T). Then \i also satisfies a Poincare inequality 
with constant c. 

Proof. For any / £ A and any t > 0, we can use Lemma [9] to express the corresponding LSI(c) for the 
function tf as 

jf* jf va#/>[/] dr d S < ^ • Ef [(r/) 2 ] . (3.160) 
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Proceeding exactly as in the proof of Theorem [31] above (i.e., by dividing both sides of the above inequality 
by i 2 and taking the limit where t — > 0), we obtain 

^var„[/]<!.E„ [(r/) 2 ]. 

Multiplying both sides by 2, we see that \x indeed satisfies ([3. 159j) . □ 

Moreover, Poincare inequalities tensorize, as the following analogue of Theorem 1241 shows: 

Theorem 33. Let X\,... ,X n E X be n independent random variables, and let P = Px 1 <8> • • • Px n be 
their joint distribution. Let A consist of all functions / : X n —> R, such that, for every i, 

E Ai, Vx' G X n - 1 (3.161) 



Define the operator V that maps each / G ^4 to 



n 

£(I\/0 2 , (3-162) 
i=i 

which is shorthand for 



n 

^(Tifiixi^)) 2 , \fx n G X n . (3.163) 

i=l 

Suppose that, for every i G {1, . . . , n}, satisfies a Poincare inequality with constant c with respect to 
(Ai,Ti). Then P satisfies a Poincare inequality with constant c with respect to (A,T). 

Proof. The proof is conceptually similar to the proof of Theorem [2H (which refers to the tensorization 
of the logarithmic Sobolev inequality), except that now we use the Efron-Stein-Steele inequality of 
Theorem [31] to tensorize the variance of /. □ 




r/(x* 



\ 



3.4 Transportation-cost inequalities 

So far, we discussed concentration of measure through the lens of various functional inequalities, primarily 
log-Sobolev inequalities. In a nutshell, if we are interested in the concentration properties of a given 
function f(X n ) of a random n-tuple X n G X n , we seek to control the divergence D(PW\\P), where 
P is the distribution of X n and P^' is its /-tilting, dP^/dP tx exp(/), by some quantity related to 
the sensitivity of / to modifications of its arguments (e.g., the squared norm of the gradient of /, as 
in the Gaussian log-Sobolev inequality of Gross [H]). The common theme underlying these functional 
inequalities is that any such measure of sensitivity is tied to a particular metric structure on the underlying 
product space X n . To see this, suppose that X n is equipped with some metric d(-, •), and consider the 
following generalized definition of the modulus of the gradient of any function / : X n — > R: 

|V/|CO^ limsup ma \ f ] ~ f if )l ■ (3-164) 

y n :d(x n ,y n )i0 a \ x >V ) 

If we also define the Lipschitz constant of / by 



(3.165) 
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and consider the class A of all functions / with ||/||Lip < 00 > then it is easy to see that the pair (A, T) with 
T/(x n ) = |V/|(x ra ) satisfies the conditions (LSI-l)-(LSI-3) listed in Section [3731 Consequently, suppose 
that a given probability distribution P for a random n-tuple X n G X n satisfies LSI(c) w.r.t. the pair 
(A, T). The use of (|3.103p and the inequality ||r/[|oo < ||/||Lip, which follows directly from ()3. 164|) and 
(|3.165p . gives the concentration inequality 

P(/PH > Ef(X n )+r) < exp f- ^jT ) ' Vr > °- ( 3 ' 166 ) 

All the examples of concentration we have discussed so far in this chapter can be seen to fit this theme. 
Consider, for instance, the following cases: 

1. Euclidean metric: for X = R, equip the product space X n = W 1 with the ordinary Euclidean metric: 

d(x n ,y n ) = \\x n -y n \ 



\ i=i 

Then, the Lipschitz constant H/ Hup of any function / : X n — > M is given by 



\f(x n )-f(y n )\ \f(x n )-f(y n )\ ,„ 1R7 v 

Li P = SU P Tf n n\ = SU P iT^ n\\ ' (3.167) 

x n^ y n a[X ,y ) x n j^y n \\ x ~ y II 

and for any probability measure P on R n that satisfies LSI(c) we have the bound ()3.166p . We have 
already seen in (|3.44p a particular instance of this with P = G n , which satisfies LSI(l). 

2. Weighted Hamming metric: for any n constants c\, . . . , c n > and any measurable space X, let us 
equip the product space X n with the metric 



i=l 



The corresponding Lipschitz constant ||/||Lipj which we also denote by ||/||Lip, c n to emphasize the role of 
the weights {cj}™ =1 , is given by 



A 



Lip,c n = SUp 



X n-Lyn a c n{X ,J/ ) 

Then it is easy to see that the condition ||/||Lip, c™ < 1 is equivalent to (I3.138P - As we have shown in 
Section [3.3,41 any product probability measure P on X n equipped with the metric d c n satisfies LSI(l/4) 
w.r.t. 



A={f: ||/|| L ip, c » <oo^ 



and r/(-) = |V/|(-) with |V/| given by ()3.164p with d = d c n. In this case, the concentration inequality 
(|3.166p (with c = 1/4) is precisely McDiarmid's inequality (|3. 139j) . 

The above two examples suggest that the metric structure plays the primary role, while the functional 
concentration inequalities like (|3.166p are simply a consequence. In this section, we describe an alternative 
approach to concentration that works directly on the level of probability measures, rather than functions, 
and that makes this intuition precise. The key tool underlying this approach is the notion of transportation 
cost, which can be used to define a metric on probability distributions over the space of interest in terms 
of a given base metric on this space. This metric on distributions is then related to the divergence via 
so-called transportation- cost inequalities. The pioneering work by K. Marton in [58] and [70] has shown 
that one can use these inequalities to deduce concentration. 
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3.4.1 Concentration and isoperimetry 

We start by giving rigorous meaning to the notion that the concentration of measure phenomenon is 
fundamentally geometric in nature. In order to talk about concentration, we need the notion of a metric 
probability space in the sense of M. Gromov [145] . Specifically, we say that a triple (X,d,n) is a metric 
probability space if (X,d) is a Polish space (i.e., a complete and separable metric space) and (j, is a 
probability measure on the Borel sets of (X,d). 

For any set A C X and any r > 0, define the r -blowup of A by 

A r = {x £ X : d(x, A) < r} , (3.168) 

where d(x, A) = mf y& A d(x, y) is the distance from the point x to the set A. We then say that the 
probability measure fi has normal (or Gaussian) concentration on (X, d) if there exist some constants 
K, k > 0, such that 



2 



fi(A) > 1/2 => fi(A r ) > 1 - Ke~ Kr , Vr > 0. (3.169) 

Remark 33. Of the two constants K and k in (|3.169p . it is k that is more important. For that reason, 
sometimes we will say that fx has normal concentration with constant k > to mean that ()3.169p holds 
with that value of k and some K > 0. 

Remark 34. The concentration condition (I3.169P is often weakened to the following: there exists some 
ro > 0, such that 

fj,(A) > 1/2 => M (A r ) > 1 - i^ e - K(r - ro)2 , Vr > r (3.170) 

(see, for example, [BUI Remark 22.23] or [6U Proposition 3.3]). It is not hard to pass from (|3. 1 70|) to the 
stronger statement (|3.169p . possibly with degraded constants (i.e., larger K and/or smaller k). However, 
since we mainly care about sufficiently large values of r, (|3.17U|) with sharper constants is preferable. 
In the sequel, therefore, whenever we talk about Gaussian concentration with constant k > 0, we will 
normally refer to (|3.170p . unless stated otherwise. 

Here are a few standard examples (see [21 Section 1.1]): 

1. Standard Gaussian distribution — if X = M. n , d(x,y) = \\x — y\\ is the standard Euclidean metric, 
and \x = G n is the standard Gaussian distribution, then for any Borel set A C W 1 with G n (A) > 1/2 we 
have 

i r ft 2 



G n (A r )>-= j exp(--) di 



oo 

I / r 2 



>l--expl--l, Vr>0 (3.171) 

i.e., (|3.169p holds with K = \ and k = \. 

2. Uniform distribution on the unit sphere — if X = S n = {x £ W l+1 : \\x\\ = l}, d is given by the 
geodesic distance on §> n , and fi = a n (the uniform distribution on E> n ), then for any Borel set A C § n 
with a n (A) > 1/2 we have 

a n (A r ) > 1 - exp f_ ( n - i y \ ; Vr > 0. (3.172) 



In this instance, (|3.169p holds with K = 1 and k = (n — l)/2. Notice that k is actually increasing with 
the ambient dimension n. 
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3. Uniform distribution on the Hamming cube — if X = {0, l} n , d is the normalized Hamming metric 



1 n 

d{x,y) = ~V"l 



i=l 

for all x = (x\, . . . , x n ),y = (yi, . . . , y n ) E {0, 1}™, and /i = B n is the uniform distribution on {0, l} n 
(which is equal to the product of n copies of a Bernoulli (1/2) measure on {0, 1}), then for any A C {0, l} n 
we have 

B n (A r ) > 1 - exp (-2nr 2 ) , Vr > (3.173) 
so (|3.169p holds with K = 1 and k = 2n. 

Remark 35. Gaussian concentration of the form ()3.169p is often discussed in the context of the so-called 
isoperimetric inequalities, which relate the full measure of a set to the measure of its boundary. To be 
more specific, consider a metric probability space (X, d, fx), and for any Borel set A C X define its surface 
measure as (see Section 2.1]) 

^(A) = liminf ^ \ A) = hminf - ^ . (3.174) 

r->0 r r->0 r 

Then, the classical Gaussian isoperimetric inequality can be stated as follows: If if is a half-space in W 1 , 
i.e., H = {x € HL n : (x, n) < c} for some n G M n with ||u|| = 1 and some c E [— oo, +oo], and if A C R n is 
a Borel set with G (A) = G n (H), then 

(G n ) + (A) > (G n ) + {H), (3.175) 

with equality if and only if A is a half-space. In other words, the Gaussian isoperimetric inequality (|3.175p 
says that, among all Borel subsets of M. n with a given Gaussian volume, the half-spaces have the smallest 
surface measure. An equivalent integrated version of f|3. 175|) says the following (see, e.g., [146] ): Consider 
a Borel set A in R n and a half-space H = {x : (x,u) < c} with ||u|| = 1, c > and G n {A) = G n (H). 
Then, for any r > 0, we have 

G n {A r ) > G n (H r ), 

with equality if and only if A is itself a half-space. Moreover, an easy calculation shows that 



So, if G(A) > 1/2, we can always choose c = and get (I3.17ip . 

Intuitively, what (|3.169p says is that, if /j, has normal concentration on (X,d), then most of the 
probability mass in X is concentrated around any set with probability at least 1/2. At first glance, this 
seems to have nothing to do with what we have been looking at all this time, namely the concentration 
of Lipschitz functions on X around their mean. However, as we will now show, the geometric and the 
functional pictures of the concentration of measure phenomenon are, in fact, equivalent. To that end, let 
us define the median of a function / : X — > R: we say that a real number mj is a median of / w.r.t. n 
(or a fi-median of /) if 

P M (/P0 > rrif) > \ and F„(f(X) < m f ) > \ (3.176) 



(note that a median of / may not be unique). The precise result is as follows: 
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Theorem 34. Let (X, d, fi) be a metric probability space. Then /i has the normal concentration property 
(I3.169D (with arbitrary constants K, k > 0) if and only if for every Lipschitz function / : X — > R (where 
the Lipschitz property is defined w.r.t. the metric d) we have 



F^f(X)>m f + r^ <Kexp(-j^~y Vr > (3.177) 



„2 

I Lip/ 

where nif is any //-median of /. 

Proof. Suppose that \x satisfies (|3.169p . Fix any Lipschitz function / where, without loss of generality, we 
may assume that ||/||Lip = 1- Let mj be any median of /, and define the set A? = jx E X : f(x) < m/1. 
By definition of the median in (I3.176j) . fJ.(A^) > 1/2. Consequently, by (I3.169h . we have 

n(Al) =P„ (d(X,A f ) < r) > 1 -Kexp(- K r 2 ), Vr > 0. (3.178) 



By the Lipschitz property of /, for any y S A$ we have /(X) — < /(X) — f(y) < d(X,y), so 
f(X) - m f < d(X,A f ). This, together with (|3.178D . implies that 

f(X) -m } <r)> ¥jd(X,A f ) < r) > 1 - i^Texp(-Kr 2 ), Vr > 



which is (J3.177I) . 

Conversely, suppose ()3.177p holds for every Lipschitz /. Choose any Borel set A with fi(A) > 1/2 
and define the function /a(^) — d(x,A) for every x E X. Then is 1-Lipschitz, since 

\Ia(x) - fA(y)\ = inf d(x, u) - inf d(y, u) 

uGA uGA 

< sup \d(x, u) — d(y, u)\ 

u£A 

< d(x,y), 

where the last step is by the triangle inequality. Moreover, zero is a median of /a, since 

P^/iW < 0) = P M (X € A) > i and P m (/a(X) > 0) > i, 

where the second bound is vacuously true since /a > everywhere. Consequently, with m/ = 0, we get 

l- At (A)=P M (d(^^) >r) 

= P m (/a(X) >m/ + r) 

< -ftTexp (— «;r 2 ) , Vr > 

which gives (|3.169p . □ 

It is shown in the following that for Lipschitz functions, normal concentration around the mean also 
implies normal concentration around any median, but possibly with worse constants [3l Proposition 1.7]: 

Theorem 35. Let (X, d, /x) be a metric probability space such that for any 1-Lipschitz function / : X — > R 
we have 

'f(X) >E M [/(X)]+r) < K exp(-K r 2 ), Vr > (3.179) 

with some constants Kq,kq > 0. Then, [i has the normal concentration property (|3.169p with K = Kq 
and k = Kq/A. Consequently, the concentration inequality in (|3.177p around any median rrif is satisfied 
with the same constants of k and K. 
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Proof. Let A C X be an arbitrary Borel set with n(A) > 0, and fix some r > 0. Define the function 
fA,r{x) = min {d(x, A), r}. Then, from the triangle inequality, ||/A,r||Lip — 1 an d 



E M [/A,r(X)]= / mm{d(x,A),r}fi(dx) 
Jx 

= I min {d(x, A), r} /j,(dx) + / min {d(x, A), r} fi(dx) 

J A J A c 

v v ' 

=0 

<rfi(A c ) 

= (1 - //(A)) r. (3.180) 



Then, 



1-M(A-) =P m ^(X,A) > r 
= P M (/ A , r (X) >r 

< Kq exp ( — k (n(A)r) 2 



where the first two steps use the definition of Ja,t, the third step uses f|3. 18Q[) . and the last step uses 
(|3.179p . Consequently, if fi(A) > 1/2, we get (|3.169p with K = Kq and k = kq/4. Theorem [311 therefore 
implies that the concentration inequality in (|3.177p holds for any median rrif with the same constants of 
k and K. □ 

Remark 36. Let (X, d, fj,) be a metric probability space, and suppose that [i has the normal concentration 
property (|3.169p (with arbitrary constants K, k > 0). Let / : X — > K be an arbitrary Lipschitz function 
(where the Lipschitz property is defined w.r.t. the metric d). In the following, we provide an upper bound 
on the distance between the mean and any median of / (w.r.t. fj,) in terms of the parameters k and K 
of ()3.169p . and the Lipschitz constant of /. From Theorem 1341 it follows that 

\^[f{X)]-m f \ 
<E M [|/pr)-m/|] 

t (|/(X) - m/| >r)dr 

"OO / ^Jl 







f°° f Kr \ 

'-L 2Kexp \W&J ir 



Lip ' 

^11/llLip (3.181) 

where the last inequality follows from the (one-sided) concentration inequality in (|3.177p and since / and 
— / are Lipschitz functions with the same constant. 



3.4.2 Marton's argument: from transportation to concentration 

As shown above, the phenomenon of concentration is fundamentally geometric in nature, as captured 
by the isoperimetric inequality (|3.169p . Once we have established (|3.169j) on a given metric probability 
space (X,d,fj,), we immediately obtain Gaussian concentration for all Lipschitz functions / : X — > R by 
Theorem [31 
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There is a powerful information-theoretic technique for deriving concentration inequalities like (|3.169p . 
This technique, first introduced by Marton (see [70] and [58]), hinges on a certain type of inequality that 
relates the divergence between two probability measures to a quantity called the transportation cost. Let 
(X, d) be a Polish space. Given p>l, let V P (X) denote the space of all Borel probability measures \i on 
X, such that the moment bound 

¥,^[d p (X,x )\ < oo (3.182) 

holds for some (and hence all) xq £ X. 

Definition 5. Given p > 1, the L p Wasserstein distance between any pair fj,, v £ V P (X) is defined as 

W p <ji,v)± inf (I d p (x,y)7r(dx,dy)) /P , (3.183) 

where n(^, v) is the set of all probability measures ir on the product space X x X with marginals \x and 
v. 

Remark 37. Another equivalent way of writing down the definition of W p (fJ,, v) is 

W p (j i ,v)= inf {E[d p (X,Y)}} 1/p , (3.184) 

where the infimum is over all pairs (X, 1") of jointly distributed random variables with values in X, such 
that Px = \i and Py = v. 

Remark 38. The name "transportation cost" comes from the following interpretation: Let [i (resp., 
v) represent the initial (resp., desired) distribution of some matter (say, sand) in space, such that the 
total mass in both cases is normalized to one. Thus, both \i and v correspond to sand piles of some 
given shapes. The objective is to rearrange the initial sand pile with shape [i into one with shape v 
with minimum cost, where the cost of transporting a grain of sand from location x to location y is given 
by c(x, y) for some sufficiently regular function c : X X X — > R. If we allow randomized transportation 
policies, i.e., those that associate with each location x in the initial sand pile a conditional probability 
distribution n{dy\x) for the destination in the final sand pile, then the minimum transportation cost is 
given by 

C*{{x,u)= inf / c(x,y)vr(dx,dy) (3.185) 

When the cost function is given by c = dP for some p > 1 and d is a metric on X, we will have 
C*(/j,, v) = W p (fj,,u). The optimal transportation problem (|3.185p has a rich history, dating back to a 
1781 essay by Gaspard Monge, who has considered a particular special case of the problem 



Cj(/i,i/)= inf W c(x,(f(x))fi(dx) : fioip =v\. (3.186) 

Here, the infimum is over all deterministic transportation policies, i.e., measurable mappings ip : X — > X, 
such that the desired final measure v is the image of /i under <p, or, in other words, if X ~ fj,, then 
Y = (p(X) ~ v. The problem (13.186j) (or the Monge optimal transportation problem, as it has now come 
to be called) does not always admit a solution (incidentally, an optimal mapping does exist in the case 
considered by Monge, namely X = M 3 and c(x, y) = \\x — y\\). A stochastic relaxation of Monge's problem, 
given by (|3.185p . was considered in 1942 by Leonid Kantorovich (and reprinted more recently [147J). We 
recommend the books by Villani [591 60J for a detailed historical overview and rigorous treatment of 
optimal transportation. 
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Lemma 14. The Wasserstein distances have the following properties: 

1. For each p > 1, W p (-, •) is a metric on V P {X). 

2. If 1 < p < q, then V P (X) 5 V q (X), and W p (fi, v) < W q (fi, v) for any /j, 1/ <E V q {X). 

3. W p metrizes weak convergence plus convergence of pth-order moments: a sequence {fj, n }^ =1 in V P {X) 
converges to /x € V P {X) in W p , i.e., Wp(/i n ,//) "~ i>oc '> 0, if and only if: 

(a) {/i n } converges to [i weakly, i.e., E^Jv 3 ] n ~*°°) lE^M for any continuous and bounded function 
cp : X R. 

(b) For some (and hence all) xq G X, 



d p (x, xo)n n (dx) > / d p (x, xo)fj,(dx). 

x Jx 

If the above two statements hold, then we say that {/i n } converges to /x weakly in V P {X). 

4. The mapping (fi, u) h-» W p (fj>, v) is continuous on ^(Af), i.e., if /i n — >• /i and v n v weakly in V P (X), then 
Wp(^n>^n) — ^ W^fj,,^). However, it is only /ower semicontinuous in the usual weak topology (without 
the convergence of pth-order moments) : if /i n — > \x and v n —> v weakly, then 

liminf Wp(jJ, n ,u n ) > W p ((i,u). 

n— »oo 

5. The infimum in (|3.183p [and therefore in (j3.184|) ] is actually a minimum; in other words, there exists an 
optimal coupling ir* £ II(^, u), such that 



W*{pL,v) = / d p (x,y)ir*(dx,dy). 

JXxX 

Equivalently, there exists a pair (X* , Y*) of jointly distributed X- valued random variables with Px* = /x 
and Py* = u, such that 

W?(n,v)=n<P(X*,Y*)]. 

6. If p = 2, X = R with d(x,y) = \x — y\, and /i is atomless (i.e., if fi({x}) = for all x E R), then the 
optimal coupling between /i and any v is given by the deterministic mapping 

y = f; 1 o f„(x) 

for X ~ /x, where F M denotes the cumulative distribution (cdf) function of /j, i.e., F^(x) = P M (X < x), 
and F" 1 is the quantile function of i/, i.e., F~ 1 (x) = inf {a € R : F 1/ (a) > x}. 

Definition 6. We say that a probability measure /x on (X, d) satisfies an LP transportation cost inequality 
with constant c > 0, or a T p (c) inequality for short, if for any probability measure i^C/iwe have 



W p (n,v) < y/2cD{y\\iJL). (3.187) 

Example 16 (Total variation distance and Pinsker's inequality). Here is a specific example illustrating 
this abstract machinery, which should be a familiar territory to information theorists. Let X be a discrete 
set equipped with the Hamming metric d(x,y) = lr x ^ y \. In this case, the corresponding L 1 Wasserstein 
distance between any two probability measures \i and v on X takes the simple form 

Wi (/*,!/)= inf P(X/Y). 
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As we will now show, this turns out to be nothing but the usual total variation distance 



\\fi - v\\ TY 4 sup \n(A) - v{A)\ = \Y J Hx) - v{x)\ (3.188) 

(we are abusing the notation here, writing n(x) for the /i-probability of the singleton {x}). To see this, 
consider any ir £ n( / u, v). Then, for any x, we have 

y&X 

and the same goes for v. Consequently, ir(x,x) < min {^(x), v{x)}, and so 

E n [d(X,Y)]=E 7r [l {x ^ Y} ] (3.189) 

= F(X / Y) (3.190) 

= l-^vr(x,x) (3.191) 

> 1 - min{/i(s),i/(x)} . (3.192) 
On the other hand, if we define the set A = {x £ X : fi(x) > z^(x)}, then 

llA* - HItv = g S ~ + 2 ^ ~ 



2 ^ Lr v , v 2 

I^(A)-K^) + ^ C )-^ C )) 

- v(A) 



and 



min {n(x), v{x)} = ^ i/(a?) + ^ m(x) 

= v{A) + ^A c ) 

= l-(fx{A)-v{A) j 

= 1- \\ii-v\\ T v. (3.193) 
Consequently, for any it £ n(//, i/) we see from (I3.189|) ~ (13. 193ft that 

P(X ^Y) = E n [d(X, Y)] > ||/i - u\\ T y. (3.194) 
Moreover, the lower bound in (|3. 194ft is actually achieved by ir* taking 

vr (x,y) =mm{fi(x),v(x)}l {x=y} + _ ^ ^y}- (3.195) 

Now that we have expressed the total variation distance — i/||tv as the L 1 Wasserstein distance induced 
by the Hamming metric on X . The well-known Pinsker's inequality 



||/x-Hlxv< y^MlM) (3-196) 
can be identified as a Ti(l/4) inequality that holds for any probability measure fj, on X. 
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Remark 39. It should be pointed out that the constant c = 1/4 in Pinsker's inequality f|3. 196j) is not 
necessarily the best possible for a given distribution fj,. Ordentlich and Weinberger |148| have obtained 
the following distribution- dependent refinement of Pinsker's inequality. Let the function ip : [0, 1/2] — >• M + 
be defined by 

1 \ fl-p 



)m(i-^), if p€ [o,r 



tp(p)± { \l-2pj '\ P y l- 2 ; (3197) 

ifp = 1/2 

(in fact, (p(p) — > 2 as p f 1/2, ip(p) — > oo as p I 0, and ip is a monotonic decreasing and convex function). 
Let X be a discrete set. For any P E V(X), define the balance coefficient 



tt p = max mm {P(A),1- P(A)} => vr P E 



1 

°>2 



Then (see [148} Theorem 2.1] with the difference that follows from the existence of a factor ^ on the 
right-hand side of (|3.188p ). for any Q E V(X), 



Qh ' V - ^rt^j) DmP) (3 - 198) 

From the above properties of the function ip, it follows that the distribution-dependent refinement of 
Pinsker's inequality is more pronounced when the balance coefficient is small (i.e., up <C 1). Moreover, 
this bound is optimal for a given P, in the sense that 

For instance, if X = {0, 1} and P is the distribution of a Bernoulli(p) random variable, then irp = 
min{p, 1 — p}, and (since (p{p) in (|3.197p is symmetric around one-half) 

1 Viz£y if P? 4 



v^p) = < V 1 - 2 pJ Vp 

and for any other Q E V({0, 1}) we have, from (|3.198p . 



ifp=i 



\P-Q\\tv < { 



M(i-p)/p]J 



(3.200) 



\d{q\\p), ifp = |. 



The above inequality provides an upper bound on the total variation distance in terms of the diver- 
gence. In general, a bound in the reverse direction cannot be derived since the total variation distance 
can be arbitrarily close to zero, whereas the divergence is equal to infinity. However, consider an i.i.d. 
sample of size n that is generated from a probability distribution P. Sanov's theorem implies that the 
probability that the empirical distribution of the generated sample deviates in total variation from P by 
at least some e E (0,2] scales asymptotically like exp(— n D*(P, e)) where 

D*{P,e)± inf D(Q\\P). 

Q--\\P-Q\\tv>s 

Although a reverse form of Pinsker's inequality (or its probability-dependent refinement in [148] ) cannot 
be derived, it was recently proved in [149] that 

D*(P,e) < p(ir P )e 2 + 0(e 3 ). 
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This inequality shows that the probability-dependent refinement of Pinsker's inequality in ()3. 198j) is 
actually tight for D*(P,e) when e is small, since both upper and lower bounds scale like ip(TTp)e 2 if 
£ < 1. 

Apart of providing a refined upper bound on the total variation distance between two discrete proba- 
bility distributions, the inequality in (|3.198p also enables to derive a refined lower bound on the relative 
entropy when a lower bound on the total variation distance is available. This approach was studied in 
[1501 Section III] in the context of the Poisson approximation where (|3.198p was combined with a new 
lower bound on the total variation distance (using the so-called Chen-Stein method) between the distri- 
bution of a sum of independent Bernoulli random variables and the Poisson distribution with the same 
mean. It is noted that for a sum of i.i.d. Bernoulli random variables, the resulting lower bound on this 
relative entropy (see |150|, Theorem 7]) scales similarly to the upper bound on this relative entropy by 
Kontoyiannis et al. (see [151| Theorem 1]), where the derivation of the latter upper bound relies on the 
logarithmic Sobolev inequality for the Poisson distribution by Bobkov and Ledoux [H] (see Section 13.3.51 
here) . 



Marton's procedure for deriving Gaussian concentration from a transportation cost inequality 
can be distilled in the following: 

Proposition 9. Suppose fj, satisfies a Ti(c) inequality. Then, the Gaussian concentration inequality in 
(I3.17UI) holds with k = l/(2c), K = 1, and r = \/2cln2. 

Proof. Fix two Borel sets A, B C X with fi(A) , n(B) > 0. Define the conditional probability measures 

a KCc\A) ^ _ } ^ A /i(cns) 



Ha{C) = -r— — and Hb(C) 



/i(A) KB) ' 

where C is an arbitrary Borel set in X. Then ha^b *C fJ>, and 

W^a^b) < W 1 (fi,fi A ) + W 1 (fi,tiB) (3.201) 



< ^/2cD(fi A \\K + y/2cD(n B \\n), (3-202) 

where (|3.20ip is by the triangle inequality, while (|3.202[) is because fj, satisfies Xi(c). Now, for any Borel 
set C, we have 



so it follows that ^a <C \i with d/U^/d/U = 1a/ p{A), and the same holds for fig- Therefore, 

£>( MA [| M )=E 



in 



ln-L, (3.203) 



d/x d/j, 

and an analogous formula holds for hb in place of ha- Substituting this into (|3.2Q2|) gives 



w ^ B ^f c ^m + f c ^iBy (3 - 204) 

We now obtain a lower bound on Wi(/j,a, /J>b)- Since ha (resp., fj,s) is supported on A (resp., B), any 
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ir £ H(fj,A, ^b) is supported on Ax B. Consequently, for any such it we have 

d(x, y) ir{dx, dy) = / d(x,y) ir(dx,dy) 
XxX JAxB 



> / inf d(x,y) ir(dx,dy) 

JaxBV^B 



d(x, B) fiA(dx) 



> inf d(x,B) ha(A) 

xeA 



= d(A,B), (3.205) 

where /j-a(A) = 1, and d(A, B) = inf x& A,yeB d(x,y) is the distance between A and B. Since (|3.205[) 
holds for every it £ TL(/j,a, /J-b), we can take the infimum over all such it and get W\{ha, A*b) — d(A,B). 
Combining this with (|3.204p gives the inequality 



d(A ' B) ^f^m + f cia W)' (3 - 206) 

which holds for all Borel sets A and B that have nonzero ^-probability. 

Let B = A c r , then fx(B) = 1 - fi(A r ) and d(A, B) > r. Consequently, (|3.206p gives 



If > 1/2 and r > V2cln2, then (^2071) gives 

M(A-) > 1 - exp (-^- (r- V2cln2^ . (3.208) 

Hence, the Gaussian concentration inequality in (|3.170p indeed holds with k = l/(2c) and K = 1 for all 
r >r = \/2cln2. □ 

Remark 40. The formula (13.2Q3|) . apparently first used explicitly by Csiszar [152|. Eq. (4.13)], is actually 
quite remarkable: it states that the probability of any event can be expressed as an exponential of a 
divergence. 

While the method described in the proof of Proposition [9] does not produce optimal concentration 
estimates (which typically have to be derived on a case- by-case basis), it hints at the potential power of 
the transportation cost inequalities. To make full use of this power, we first establish an important fact 
that, for p £ [1,2], the T p inequalities tensorize (see, for example, [60l Proposition 22.5]): 

Proposition 10 (Tensorization of transportation cost inequalities). For any p £ [1,2], the following 
statement is true: If /x satisfies T p (c) on (X,d), then, for any n £ N, the product measure [i® n satisfies 
T p (cn 2 / p ~ l ) on (X n ,d p>n ) with the metric 

/ n \ Vp 

d p , n (x n ,y n )± r^dP^yi)) , Vx n ,y n £X n . (3.209) 

Proof. Suppose [i satisfies T p (c). Fix some n and an arbitrary probability measure v on (X n ,d Ptn ). Let 
X n ,Y n £ X n be two independent random n-tuples, such that 

Px« = Px, ® Px 2 \x i ® . . . ® Px n \xn-i = v (3.210) 
Pyn = P Yl Py 2 <B> • • • ® Py n = V® n - (3.211) 
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For each i G {1, . . . , n}, let us define the "conditional" W p distance 



W p {P Xi]X i-i,Py i \P X i--L) ± / WP(P Xi]xi - 1=xi - 1 ,Py i )P xi - 1 (dx 
We will now prove that 

n 

WV(v,fi® n ) = W?(P Xn ,P Yn ) < Y.W^Px^^Py^Px^) 



l/p 



(3.212) 



i=l 



where the L p Wasserstein distance on the left-hand side is computed w.r.t. the d p ^ n metric. By Lemma \14 
there exists an optimal coupling of P Xl and Py, , i.e., a pair (X*,Y*) of jointly distributed X- valued 
random variables, such that P x * = P Xl , Py^ = Pyi, an d 

WV(P Xl ,P Yl )=E[d?(X* l ,Y 1 *)]. 

Now for each i = 2, . . . ,n and each choice of x' l ~ l G X 1 ^ 1 , again by Lemma [Tl"l there exists an optimal 
coupling of P x .i X i-i =x i-i and Py , i.e., a pair (X* Y*^ 1 ^ 1 )) of jointly distributed Af-valued random 

variables, such that P x *( x i ~ 1 ) = Px t \x i - 1 =x i - 1 i Py*{x i - 1 ) = an d 



W^P XilXl - 1=x ^,P Yi )=E[d p (XUx^),Y*(x^))}. 



(3.213) 



Moreover, because X is a Polish space, all couplings can be constructed in such a way that the mapping 
^ P((X*(x i ~ 1 ),y/(x^ 1 )) G C) is measurable for each Borel set C C X x X [60]. In other words, 
for each i, we can define the regular conditional distributions 



P X*Y*\X* i ~ 1 =x i - 1 - -fx*^- 1 )^*^- 1 )' 



Vx*- 1 G 



such that 



P X *"Y* n — Px{Y* ® Px%Y£\xi ® ■ ■ ■ ® ^x^y^ix*'"' 1 ) 



is a coupling of Px™ = ^ and Pyn = ^® n , and 

WJ(P^| Xi -i,P yi )=EK(X*,^)|X*(- 1 )], 
By definition of W p , we then have 

W$(v,fx® n ) <n<F p>n (X™,Y* n )\ 

n 

= Y,nd p {xhY*)\ 



i = 1, 



, n. 



i=i 



11 

J2^[dP(X* ,Y*)\X*^ 

i=l 
n 

^^(P^i^-^iVjP^-i), 



(3.214) 

(3.215) 
(3.216) 

(3.217) 

(3.218) 



where: 



(|3.215|) is due to the fact that (x* n ,y* n ) is a (not necessarily optimal) coupling of P X n = u and 

Jry-n. — n , 



(I3T2T6D is by the definition (l3T209j) of d p>n , 
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(|3,217p is by the law of iterated expectation; and 
(j3T2T8l) is by (^2431) . 

We have thus proved (|3,212p . By hypothesis, [i satisfies T p (c) on (X,d). Therefore, since P Yi = fi for 
every i, we can write 

W$(P Xi \ xi -r,P Yi \P xi -,) = [ WV(P Xi \ x ^ =xl -,,P Yi )P x ^(&X*- 1 ) 

< [ {2cD(P Xilx ,- 1=x ,- 1 \\P Yi )) p/2 P x ^(dx i - 1 ) 

= (2cf/ 2 {DiP^-AWY^Px^Y' 2 • (3.219) 

Consequently, it follows that 

n 

W^^ n ) < (2c)^Y,( D ( P ^\x^\\Py 1 \Px^)) P/2 

i=l 

I n \P/ 2 

< (2c)^ 2 n 1 -^ 2 f J^DiPx^WPnlPx^)) 

= (2c) p / 2 n 1 - p / 2 (D(P X n\\P Y n)f/ 2 
= (2c) p /V- p / 2 J DH|/x® n ) p / 2 , 

where the first line holds by summing from i = 1 to i = n and using ()3.212p and ()3.219|) . the second line 
is by Holder's inequality, the third line is by the chain rule for the divergence and since P Y n is a product 
probability measure, and the fourth line is by (|3.210p and (|3.21ip . This finally gives 

W p (u,n® n ) < ^/2cn 2 /P- 1 D(i/||/i® n ), 
i.e., ^i® n indeed satisfies the T p (cn 2//p_1 ) inequality. □ 

Since Wz dominates W\ (cf. item 2 of Lemma ll4p . a T2(c) inequality is stronger than a Ti(c) inequality 
(for an arbitrary c > 0). Moreover, as Proposition 1101 above shows, T2 inequalities tensorize exactly: if n 
satisfies T2 with a constant c > 0, then n® n also satisfies T2 for every n with the same constant c. By 
contrast, if /x only satisfies Ti(c), then the product measure //® n satisfies T\ with the much worse constant 
cn. As we shall shortly see, this sharp difference between the T\ and T2 inequalities actually has deep 
consequences. In a nutshell, in the two sections that follow, we will show that, for p £ {1,2}, a given 
probability measure ji satisfies a T p (c) inequality on (X,d) if and only if it has Gaussian concentration 
with constant l/(2c). Suppose now that we wish to show Gaussian concentration for the product measure 
/i® n on the product space {X n , di n )- Following our tensorization programme, we could first show that [i 
satisfies a transportation cost inequality for some p E [1,2], then apply Proposition [10] and consequently 
also apply Proposition [9l If we go through with this approach, we will see that: 

If /j, satisfies T\(c) on (X,d), then [i® n satisfies T\{cn) on (X n , di, n ), which is equivalent to Gaussian 
concentration with constant l/(2cn). Hence, in this case, the concentration phenomenon is weakened by 
increasing the dimension n. 

If, on the other hand, fi satisfies T2(c) on (X,d), then //® n satisfies T2(c) on (X n ,d2, n )i which is equiv- 
alent to Gaussian concentration with the same constant l/(2c), and this constant is independent of the 
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dimension n. Of course, these two approaches give the same constants in concentration inequalities for 
sums of independent random variables: if / is a Lipschitz function on (X,d), then from the fact that 



t=l 

l 

In \ 2 



< Vn I ^2d 2 (xi,yi) 
_ \i=i / 
= Vnd 2t n(x n ,y n ) 

we can conclude that, for f n {x n ) = Yli=i fi x i)> 



,. , , \fn(x n ) ~ fn(y n )\ . \\f\\u P 
\\fn\\Up,l = sup < 



A 

E n^ y n ai^ n {x ,y ) n 



,,,,, A \fn(x n ) ~ fn(y n )\ < H/llLip 
||/n||Lip,2 = SUp < — =- 

x"^i/™ U2,n\X ,y ) \jn 



Therefore, both T\{c) and ?2(c) give 



^E/W>r)<«p , Vr>0 



\ n ~^ / V 2c ^HLip y 

where X\, . . . , X n G X are i.i.d. random variables whose common marginal \i satisfies either T<i(c) or 
Xi(c), and / is a Lipschitz function on X with E[/(Xi)] = 0. The difference between T\ and T2 inequal- 
ities becomes quite pronounced in the case of "nonlinear" functions of X\, . . . ,X n . However, it is an 
experimental fact that T\ inequalities are easier to work with than T2 inequalities. 

The same strategy as above can be used to prove the following generalization of Proposition [TO) 

Proposition 11. For any p £ [1,2], the following statement is true: Let fj,\, . . . , fj, n be n Borel probability 
measures on a Polish space (X, d), such that fn satisfies T p (ci) for some Cj > 0, for each i = 1, . . . , n. Let 
c = maxi<i< n q. Then \i = \i\ (g) . . . fx n satisfies T p (cn 2 / p_1 ) on (X n , d Pyn ). 



3.4.3 Gaussian concentration and T% inequalities 

As we have shown above, Marton's argument can be used to deduce Gaussian concentration from a 
transportation cost inequality. As we will demonstrate here and in the following section, in certain cases 
these properties are equivalent. We will consider first the case when fj, satisfies a T\ inequality. The first 
proof of equivalence between T\ and Gaussian concentration is due to Bobkov and Gotze [53], and it 
relies on the following variational representations of the L 1 Wasserstein distance and the divergence: 

1. Kantorovich Rubinstein theorem [591 Theorem 1.14]: For any [i, v 6 Vi(X) with (X, d) = (M, | • |), 

W 1 (fi,u)= sup |E M [/]-E,[/]|. (3.220) 

/:||/llLi P <l 

2. Donsker Varadhan lemma [84, Lemma 6.2.13]: For any two Borel probability measures /U, v such that 
v <§C /i, the following variational representation of the divergence holds: 



D(u\\fi)= sup {E u [g} -lnE„[exp( fl )]} 

geC b (K) 



(3.221) 
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where the supremization in f|3.221 [) is over the set Cb(R) of continuous and bounded real- valued functions. 
Furthermore, for any measurable function g such that E M [exp(g)] < oo, 

E v [g] < D(u\\n) + lnE^[exp( 5 )]. (3.222) 

(In fact, the supremum in (|3.22ip can be extended to bounded Borel- measurable functions g \153\ 
Lemma 1.4.3].) 

Theorem 36 (Bobkov-Gotze [53J). A Borel probability measure fi £ Vi(X) satisfies Tl(c) if and only if 
the inequality 

E„ {exp[tf(X)]} < exp (3.223) 
holds for all 1-Lipschitz functions / : X ->• R with E M [/(Jf)] = 0, and all tel. 

Remark 41. The moment condition Mu[d(X, xq)] < oo is needed to ensure that every Lipschitz function 
/ : X — > R is /i-integrable: 

E M [|/(X)|] < |/(x )| +E M [|/(X) - f(x )\] < \f(x )\ + H/lkip E M [d(X,x )] < oo. 

Proof. Without loss of generality, we may consider (13.223P only for t > 0. 

Suppose first that /j, satisfies Ti(c). Consider some v <C \x. Using the Ti(c) property of fi together 
with the Kantorovich-Rubinstein formula (13.220p . we can write 



I /di/< Wi(i/,/z) < y/2cD(u\\n) 
JX 



for any 1-Lipschitz / : X — >• R with E^[/] = 0. Next, from the fact that 



M(- t + b {) = v2„/, (3.22 1, 



for any a, b > 0, we see that any such / must satisfy 



D(v\\u) ct 
fdu< V Vt>0. 
x t i 



Rearranging, we obtain 

/ tfdv-^<D(v\\n), Vt>0. 
JX 2 

Applying this inequality to v = (the g-tilting of /i) where g = tf, and using the fact that 

D(^\\n)= [ g&^-lnf exp( 5 )d^ 
JX JX 

= / tfdv — ln exp(i/)dyu 
Jx Jx 

we deduce that 

In (J exp(t/) df?j < ^ 
for all t > 0, and all / with ||/|| Lip < 1 and E M [/] = 0, which is precisely (pTM23j) . 
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Conversely, assume that pt satisfies (|3.223|) for all 1-Lipschitz functions / : X — > M with E M [/(AT)] = 
and all t £ I, and let v be an arbitrary Borel probability measure such that v <C fi. Consider any 
function of the form g = tf where t > 0. By the assumption in (|3.223|) . E /1 [exp((jr)] < oo; furthermore, g 
is a Lipschitz function so it is also measurable. Hence, ()3.222[) gives 



D(u\\pi)> / tfdu-ln / exp(tf)dpi 
Jx Jx 



> [ tfdv- [ 

JX J A 



X 1 



where in the second step we have used the fact that f x fdpi = by hypothesis, as well as (|3.223|) . 
Rearranging gives 



Jx Jx 



fdfi 



< 



D(y\ 



+ 



ct 



Vt>0 



(3.225) 



(the absolute value in the left-hand side is a consequence of the fact that exactly the same argument goes 
through with — / instead of /). Applying (|3.224j) . we see that the bound 



fdu- / fdfi 

x Jx 



< ^2cD(v\\fi) 



(3.226) 



holds for all 1-Lipschitz / with E^[/] = 0. In fact, we may now drop the condition that E^[/] = by 
replacing / with / — E^[/]. Thus, taking the supremum over all 1-Lipschitz / on the left-hand side of 
(|3,226p and using the Kantorovich-Rubinstein formula (|3.220p . we conclude that W±(fi, v) < ^/2cD{v\\fi) 
for every u ^ pi, i.e., fi satisfies T\(c). This completes the proof of Theorem 1361 □ 



Theorem 1361 gives us an alternative way of deriving Gaussian concentration for Lipschitz functions: 

Corollary 10. Let A be the space of all Lipschitz functions on X, and let pt £ Vi(X) be a Borel 
probability measure that satisfies T\{c). Then, the following inequality holds for every / £ A: 



f(X)>EJf(X)]+r <exp 



2c||/||L 



Vr > 0. 



Proof. The result follows from the Chernoff bound and ()3.223|) . 

As another illustration, we prove the following bound: 
Theorem 37. Let X = {0, 1}™, equipped with the metric 



(3.227) 
□ 



i=l 



(3.228) 



Let Xi, . . . ,X n G {0,1} be i.i.d. Bernoulli (p) random variables. Then, for every Lipschitz function 
/:{0,1} W ^R, 



f(X n )-E[f(X n )]>r) <exp 



ln[(l — p)/p] r 2 
L P (1-2P) 



Vr > 0. 



(3.229) 



Remark 42. In the limit as p — > 1/2, the right-hand side of (j3.229[) becomes exp 



2r 2 



n\\f\\l lp J- 
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Proof. Taking into account Remark l4"2l we may assume without loss of generality that p ^ 1/2. From the 
distribution-dependent refinement of Pinsker's inequality (I3.200p . it follows that the Bernoulli(p) measure 
satisfies Ti(l/(2ip(p))) w.r.t. the Hamming metric, where <p(p) is defined in (|3.197p . By Proposition 1 101 
the product of n Bernoulli(p) measures satisfies T\{n / {2(p(p))) w.r.t. the metric (|3,228p . The bound 
(13.2291) then follows from Corollary \M □ 

Remark 43. If ||/|| L i P < C/n for some C > 0, then ([3.2291) implies that 

F^f(X n ) - E[f(X n )} >r^< exp (- ^"^ • nr^j , Vr > 0. 

This will be the case, for instance, if f{x n ) = (1/n) Ya=i fi( x i) f° r some functions fx, . . . ,f n : {0, 1} — > R 
satisfying |/i(0) — fi(l)\ < C for all £ = 1, . . . , n. More generally, any / satisfying (|3.138p with q = c^/n, 
i = 1, . . . , n, for some constants c' l5 . . . , c' n > 0, satisfies 

^(fiX 71 ) -nf(X n )} >rj< exp (- ( i^2p)El^ '" r2 ) ' Vf > °" 
3.4.4 Dimension-free Gaussian concentration and T 2 inequalities 

So far, we have confined our discussion to the "one-dimensional" case of a probability measure fi on a 
Polish space (X, d). Recall, however, that in most applications our interest is in functions of n independent 
random variables taking values in X . Proposition [10] shows that the transportation cost inequalities 
tensorize, so in principle this property can be used to derive concentration inequalities for such functions. 
However, as suggested by Proposition [10] and the discussion following it, T% inequalities are not very 
useful in this regard, since the resulting concentration bounds will deteriorate as n increases. Therefore, 
we seek a direct characterization of a much stronger concentration property, the so-called dimension-free 
Gaussian concentration. 

Once again, let (X,d,fi) be a metric probability space. We say that [i has dimension-free Gaussian 
concentration if there exist constants K, k > 0, such that for any k G N 

A C X k and ^ k {A) > 1/2 => p® k {A r ) > 1 - Ke~ Kr \ Vr > (3.230) 

where the isoperimetric enlargement A r of a Borel set A C X k is defined w.r.t. the metric d^ = 
defined according to (j3.2Q9j) : 

A r ± Lj k G X k : 3x k G A s.t. ^d 2 {x uyi ) < r 2 | . 

Remark 44. As before, we are mainly interested in the constant k in the exponent. Thus, we may 
explicitly say that /i has dimension- free Gaussian concentration with constant k > 0, meaning that 
(I3.230P holds with that k and some K > 0. 

Remark 45. In the same spirit as Remark 1341 it may be desirable to relax (|3,230p to the following: there 
exists some ro > 0, such that for any k G N 

A C X k and n® k (A) > 1/2 => ^ k {A r ) > 1 - Ke~ K{ - r ~ T ^\ Vr > r . (3.231) 

(see, for example, [60, Remark 22.23] or [641 Proposition 3.3]). The same considerations about (possibly) 
sharper constants that were stated in Remark [M] also apply here. 



In this section, we will show that dimension-free Gaussian concentration and T 2 are equivalent. Before 
we get to that, here is an example of a T 2 inequality: 
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Theorem 38 (Talagrand [IM])- Let X = W 1 and d(x,y) = \\x - y\\. Then \i = G n satisfies a T 2 (l) 
inequality. 

Proof. The proof starts for n = 1: let ^ = G, let v G 'P(M) have density / w.r.t. ^x: / = and let $ 
denote the standard Gaussian cdf, i.e., 

/x px / 2 \ 

1 {y)dy = -=J exp(-|-W Vx G R. 

If X ~ G, then (by item 6 of Lemma [T4"|) the optimal coupling of /U = G and z^, i.e., the one that achieves 
the infimum in 



W 2 (u, M ) = W^, G) = (E[(X - Y? 



= vv 2 \ is, "-r I — mi I in* 1 1 -/v i 

X~G,Y~v v 

is given by Y = h(X) with h = F^ 1 o $. Consequently, 



lf 2 2 (.,G)=E[(I-/ l (I)) 2 ] 



(x- /i(x)) 2 7(x)dx. (3.232) 
Since dv = f dfi with fi = G, and F l/ (/i(x)) = <3?(x) for every then 

x phi^x^j phii^x^ f'}\(yX^ 

' 7 ( y )dy = <S>(x) = F u (h(x)) = du= fdfx= f(yh(y)dy. (3.233) 

oo J — oo J — oo J —oo 

Differentiating both sides of (I3.233P w.r.t. x gives 

ti(x)f(h(x))i(h(x)) = 7 (x), Vx G M (3.234) 
and, since /i = FjJ 1 o <]?, then h is a monotonic increasing function and 

lim h(x) = — oo, lim /i(x) = oo. 



Moreover, 



D(v\\G) = D(v\\v) 

A 1 dU 

dv m — 

coo 

In (/(x)) dz/(x) 

1 — oo 
»oo 

/(x) In (/(x)) d//(x) 

' — oo 
»oo 

/(x) In (/(x)) 7(x)dx 

-oo 

"OO 

f(h(x)) In (/(%))) 7 (/*(x)) h'(x)dx 



- 'OO 

oo 



In (/(/i(x))) 7(x) dx (3.235) 

while using above the change-of- variables formula, and also ()3.234p for the last equality. From (I3.234p . 
we have 

In f(h(x)) = In ( ^$ r | = h2 W- x2 _ l nh '( x ) 

U'(x) 7 (%)W 2 lJ 
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so, by substituting this into (|3.235p . it follows that 

poo yoo 

D{y\\fi) = — / [h 2 (x) — x 2 ] j(x) dx — / Inh' (x) j(x) dx 

" J—oo J—oo 

j /'OO ^ /"OO /'OO 

= - I (x — /i(x)) 7(x)dx+ / x(h{x) — x) ~i{x) dx — j h\h'{x)^{x)dx 

__ 1 
~~ 2 



-oo J—oo J—oo 

oo ^ roo poo 

(x — h(x)) j(x)dx+ / (h' (x) — 1) j(x) dx — / In 7(2:) dx 



-oo 
oo 



> - / (x — /i(x)) z 7(2;) dx 

where the third line relies on integration by parts, the forth line follows from the inequality lni < t — 1 
for t > 0, and the last line holds due to (|3.232j) . This shows that \i = G satisfies 2~2(1), so it completes the 
proof of Theorem [38] for n = 1. Finally, this theorem is generalized for an arbitrary n by tensorization 
via Proposition [TUJ □ 

We now get to the main result of this section, namely that dimension-free Gaussian concentration 
and T2 are equivalent: 

Theorem 39. Let (X, d, fj,) be a metric probability space. Then, the following statements are equivalent: 

1. fj, satisfies 22(c). 

2. fi has dimension-free Gaussian concentration with constant k = l/(2c). 



Remark 46. As we will see, the implication 1) 2) follows easily from the tensorization property of 
transportation cost inequalities (Proposition 1 1Q[) . The reverse implication 2) =4> 1) is a nontrivial result, 
which was proved by Gozlan [64J using an elegant probabilistic approach relying on the theory of large 
deviations [M] . 

Proof. We first prove that 1) 2). Assume that (i satisfies ?2(c) on (X, d). Fix some k S N and consider 
the metric probability space (X k , d2 t k, where the metric c?2,fc is defined by (|3.2Q9j) with p = 2. By 

the tensorization property of transportation cost inequalities (Proposition I10p . the product measure [i® k 
satisfies 72(c) on (X k ,d2,k)- Because the L 2 Wasserstein distance dominates the L 1 Wasserstein distance 
(by item 2 of Lemma [T4"|) . also satisfies Ti(c) on (X k , d,2 k)- Therefore, by Proposition O ^® k has 
Gaussian concentration (|3.170p with respect to &i k with constants k = l/(2c),K = l,ro = V2cln 2. 
Since this holds for every k £ N, we conclude that [i indeed has dimension-free Gaussian concentration 
with constant n = l/(2c). 

We now prove the converse implication 2) 1). Suppose that \x has dimension- free Gaussian concen- 
tration with constant k > 0, where for simplicity we assume that ro = (the argument for the general 
case of ro > is slightly more involved, and does not contribute much in the way of insight). Let us fix 
some k G N and consider the metric probability space (X k , c^/t, /i® fc ). Given x k G X k , let P x k be the 
corresponding empirical measure, i.e., 

1 k 

1=1 

where S x denotes a Dirac measure (unit mass) concentrated at x £ X . Now consider a probability measure 
v on X, and define the function f v : X k — > R by 



fv(x k ) = W2(P x k, v), Vx k £X k . 
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We claim that this function is Lipschitz w.r.t. d 2 ^ with Lipschitz constant To verify this, note that 

\U(x k ) - U{y k )\ = \W 2 (P xk ,u)-W 2 (P yk ,u)\ 

< W 2 (P x k,P yk ) (3.237) 

inf (( d 2 (x,y)ir(dx,dy)) ' (3.238) 
■nm{9 xk ,P y k) \Jx J 

(l k \ 1 ' 2 



\ i=i / 

= ±=d 2 , k {x\y k ), (3.240) 

where 

(|3.237p is by the triangle inequality; 
(133381) is by definition of W 2 ; 

(13.239|) uses the fact that the measure that places mass 1/k on each (xi,yi) for i G {1, ...,&}, is an 
element of II (P^, P y k) (due to the definition of an empirical distribution in (|3.236p . the marginals of the 
above measure are indeed P^fc and P y k); and 

([33301) uses the definition (13309]) of d 2}k . 

Now let us consider the function fk{x k ) = W 2 {P x k , /i), for which, as we have just seen, we have ||/fc||Lip,2 < 
1/Vk. Let X\, . . . ,X]f be i.i.d. draws from fi. Let denote some //® fc -median of fk- Then, by the 
assumed dimension-free Gaussian concentration property of /u, we have from Theorem [341 



^(fk(X k )>m k + r) <exp 



2 



ll^ fc llLip,2 / 

< exp (-Kkr 2 ) , Vr>0;fcGN (3.241) 

where the second inequality follows from the fact that ||/fc||Lj p 2 < \- 

We now claim that any sequence {m,k} k x L 1 of medians of the f^s converges to zero. If X\,X 2 , . . . 
are i.i.d. draws from /j,, then the sequence of empirical distributions {P X k} k x L 1 almost surely converges 
weakly to [i (this is known as Varadarajan's theorem |155l Theorem 11.4.1]). Therefore, since W 2 metrizes 
the topology of weak convergence together with the convergence of second moments (cf. Lemma [T4")) , 
linifc^oo W 2 (P X k, [i) = almost surely. Hence, since convergence almost surely implies convergence in 
probability, 

lim T(W 2 {P X k,ii) >t) =0, Vt>0. 

k— >oo 

Consequently, any sequence {m^} of medians of the /^'s converges to zero, as claimed. Combined with 
(|3.24ip . this implies that 

lim sup — lnP(w 2 (P xfc , / u) > r) < -nr 2 . (3.242) 



k— ¥oo 



k 



On the other hand, for a fixed [i, the mapping v i— > W 2 (i',ij,) is lower semicontinuous in the topology of 
weak convergence of probability measures (cf. Lemma [Til. Consequently, the set {// : W 2 (P X k,fi) > r} 
is open in the weak topology, so by Sanov's theorem [84, Theorem 6.2.10] 

liminf r lnW>(w 2 (P X k, n) >r)> - inf {DMl/j,) : W 2 U,v) > r} . (3.243) 
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Combining (13.2421) and (13.2431) . we get that 

mf{D(u\\fj,) : W 2 {li,v) > r) > Kr 2 

which then implies that DiuWfi) > kW^^,^)- Upon rearranging, we obtain W2{fJ-,v) < yj (^) D(u\\fi), 
which is a T 2 (c) inequality with c = ^ . This completes the proof of Theorem [39j □ 

3.4.5 A grand unification: the HWI inequality 

At this point, we have seen two perspectives on the concentration of measure phenomenon: functional 
(through various log-Sobolev inequalities) and probabilistic (through transportation cost inequalities). 
We now show that these two perspectives are, in a very deep sense, equivalent, at least in the Euclidean 
setting of M. n . This equivalence is captured by a striking inequality, due to Otto and Villani [156] . 
which relates three measures of similarity between probability measures: the divergence, L 2 Wasserstein 
distance, and Fisher information distance. In the literature on optimal transport, the divergence between 
two probability measures Q and P is often denoted by H(Q\\P) or H(Q,P), due to its close links to the 
Boltzmann //-functional of statistical physics. For this reason, the inequality we have alluded to above 
has been dubbed the HWI inequality, where H stands for the divergence, W for the Wasserstein distance, 
and I for the Fisher information distance. 

As a warm-up, we first state a weaker version of the HWI inequality specialized to the Gaussian 
distribution, and give a self-contained information-theoretic proof following [157J: 

Theorem 40. Let G be the standard Gaussian probability distribution on BL Then, the inequality 

D(P\\G) < W 2 (P, G)y/I(P\\G), (3.244) 

where W% is the L 2 Wasserstein distance w.r.t. the absolute- value metric d(x,y) = \x — y\, holds for any 
Borel probability distribution P on R, for which the right-hand side of (|3.244p is finite. 

Proof. We first show the following: 

Lemma 15. Let X and Y be a pair of real-valued random variables, and let N ~ G be independent of 
(X,Y). Then, for any t > 0, 

D(P x+V - tN \\P Y+V - tN ) <Y t W 2 {P x ,Py). (3.245) 

Proof. Using the chain rule for divergence, the divergence D(P X Y x+VtN ll-^x yy+^In) ^ s expanded in 
two ways: 

^(-^X,y,X+v / tA r II^X,y,y+ v / tA r -' = ^^X+ViNW-^Y+^/tN) + D{P XtY \x+y/iN \\Px,Y\Y+VtN\Px+ViN) 

> D(P x+V - tN \\P Y+ViN ) (3.246) 

and 

D (Px,Y,X+VtN\\Px,Y,Y+ViN) = D ( P X+yftN II P Y+VIn\ P X,y) 

( = } E[D(Af(X,t)\\Af(Y,t))\X,Y} 

®I E [(X-Y) a ]. (3.247) 

Note that equality (a) holds since ./V is independent of (X,Y), and equality (b) is a special case of the 
equality 

D ( u (mi A) iiaw)) - i m (4) + \ (^p£ 4 r i). 
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It therefore follows from (13.2461) and (|3.247D that 

D(P x+V - tN \\P Y+ViN ) < Y f n(X - Yf] (3.248) 

where the left-hand side of this inequality only depends on the marginal distributions of X and Y (due 
to the independence of (X, Y) and N ~ G). Hence, taking the infimum of the right-hand side of (|3.248[) 
w.r.t. all couplings of P x and Py (i.e., all fi £ U(P X ,P Y )), we get (13.2451) (see (|3.184j) l. □ 

We proceed in the following with the proof of Theorem [4*01 Let X have distribution P, and Y have 
distribution G. Without loss of generality, we may assume that X has zero mean and unit variance. We 
define the function 

F(t)±D(P x+Viz \\P Y+v - tz ), Vt>0 
where Z ~ G is independent of (X, Y). Then -F(O) = D(P\\G), and from (|3.245p we have 

F(t) <±- t Wi(P x ,P Y ) = ±wi(P,G). (3.249) 

Moreover, the function F(t) is differ entiable, and it follows from [131} Eq. (32)] that 

F'(t) = ^ [mmse(X, t' 1 ) - lmmse(X, t" 1 )] (3.250) 

where mmse(X, •) and lmmse(X, •) have been defined in (|3.57p and (|3.59p . respectively. For any t > 0, 

D(P\\G)=F(0) 

= -(F(t)-F(0))+F(t) 

rt 

F'(s)ds + F(t) 



= - [ \ (lmmse(X, s" 1 ) - mmse(X, s" 1 )) ds + F(t) (3.251) 

4i'(^TT)~ ^(x) + 1 ) ) ds + ^ l(P - G) (3 ' 252) 

l( ln tJ(X) + l + W](P,G)\ 



a( We)(+ SM) (3 . 25 5) 

where 

(I3T25TD uses (I3T250D : 

(|3352l) uses ([3i)0i the Van Trees inequality (f3~M|) . and (13349]) : 
(|3.253|) is an exercise in calculus; 

(|3.254p uses the inequality Inx < x — 1 for x > 0; and 

(|3.255p uses the formula f|3.54|) (so I(P\\G) = J(X) — 1 since X ~ P has zero mean and unit variance, 
and one needs to substitute s = 1 in (I3.54p to get G s = G), and the fact that t > 0. 
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Optimizing the choice of t in (|3.255|) , we get (|3.244|) . □ 

Remark 47. Note that the HWI inequality (|3.244p together with the T 2 inequality for the Gaussian 
distribution imply a weaker version of the log-Sobolev inequality (|3.4ip (i.e., with a larger constant). 
Indeed, using the T 2 inequality of Theorem 1381 on the right-hand side of (|3.244p . we get 

D(P\\G)<W 2 (P,G) V / J(PJG) 

< V2D(P\\G)^I{PjG), 

which gives D(P\\G) < 2I(P\\G). It is not surprising that we end up with a suboptimal constant here as 
compared to (j3.4ip : the series of bounds leading up to (|3.255p contributes a lot more slack than the single 
use of the van Trees inequality ()3.6ip in our proof of Stam's inequality (which, due to Proposition \7\ is 
equivalent to the Gaussian log-Sobolev inequality of Gross). 

We are now ready to state the HWI inequality in its strong form: 

Theorem 41 (Otto-Villani |156j). Let P be a Borel probability measure on M n that is absolutely 
continuous w.r.t. the Lebesgue measure, and let the corresponding pdf p be such that 

V 2 ln (^\ h KI n (3.256) 

for some KgI (where V 2 denotes the Hessian, and the matrix inequality A >z B means that A — B is 
non- negative semidefinite). Then, any probability measure Q <C P satisfies 

D(Q\\P) < W 2 (Q,P) v r UMP)-§ Wi(Q,P). (3.257) 

We omit the proof, which relies on deep structural properties of optimal transportation mappings 
achieving the infimum in the definition of the I? Wasserstein metric w.r.t. the Euclidean norm in W 1 . 
(An alternative simpler proof was given later by Cordero-Erausquin [158J.) We can, however, highlight 
a couple of key consequences (see |156j ): 

1. Suppose that P, in addition to satisfying the conditions of Theorem HU also satisfies a T 2 (c) inequality. 
Using this fact in (|3.257[) . we get 

D(Q\\P) < ^2cD(Q\\P)^IMp)-y Wi(Q,P). (3.258) 

If the pdf p of P is log-concave, so that (|3.256j) holds with K = 0, then (|3.258|) implies the inequality 

D{Q\\P) < 2cI(Q\\P) (3.259) 

for any Q <C P. This is an Euclidean log-Sobolev inequality that is similar to the one satisfied by P = G n 
(see Remark I47p . However, note that the constant in front of the Fisher information distance on the 

right-hand side of (|3.259|) is suboptimal, as can be verified by letting P = G n , which satisfies T 2 (V); going 
through the above steps, as we know from Section [3.21 (in particular, see ({3.4=1 j) ) . the optimal constant 
should be |, so the one in (|3.259p is off by a factor of 4. On the other hand, it is quite remarkable that, 
up to constants, the Euclidean log-Sobolev and T 2 inequalities are equivalent. 

2. If the pdf p of P is strongly log-concave, i.e., if (|3.256p holds with some K > 0, then P satisfies the 
Euclidean log-Sobolev inequality with constant Indeed, using Young's inequality ab < a , we have 
from ([3T25T)) 

D(Q\\P) < y^W 2 (Q,P)^^-^ Wi(Q,P) 
1 
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which shows that P satisfies the Euclidean LSl(-^r) inequality. In particular, the standard Gaussian 
distribution P = G n satisfies (|3.256|) with K = 1, so we even get the right constant. In fact, the 
statement that (|3.256p with K > implies Euclidean LSl(^) was first proved in 1985 by Bakry and 
Emery [159] using very different means. 



3.5 Extension to non-product distributions 

Our focus in this chapter has been mostly on functions of independent random variables. However, there 
is extensive literature on the concentration of measure for weakly dependent random variables. In this 
section, we describe (without proof) a few results along this direction that explicitly use information- 
theoretic methods. The examples we give are by no means exhaustive, and are only intended to show 
that, even in the case of dependent random variables, the underlying ideas are essentially the same as in 
the independent case. 

The basic scenario is exactly as before: We have n random variables X\, . . . ,X n with a given joint 
distribution P (which is now not necessarily of a product form, i.e., P = Px n may not be equal to 
Pxi <8> • • • ® Px n )i an d we are interested in the concentration properties of some function f(X n ). 



3.5.1 Samson's transportation cost inequalities for dependent random variables 

Samson [160] has developed a general approach for deriving transportation cost inequalities for dependent 
random variables that revolves around a certain I? measure of dependence. Given the distribution 
P = Pxn of (X\, . . . ,X n ), consider an upper triangular matrix A G M rix?1 , such that Ajj = for i > j, 
Aj j = 1 for all i, and for i < j 



Aj j = sup sup 



Pxv-\X i =x i) X i ~ 1 =x i - 1 — Px™\X i =x' i ,X i - 1 =x i - 1 



(3.260) 

TV 



Note that in the special case where P is a product measure, the matrix A is equal to the n x n identity 
matrix. Let ||A|| denote the operator norm of A in the Euclidean topology, i.e., 

II A II * II AV H II A II 

||A|| = SUp "n - !! - = SU P l|Av||. 

«el":^0 \\ v \\ v£R n : \\v\\=l 

Following Marton |161j . Samson considers a Wasserstein-type distance on the space of probability mea- 
sures on X n , defined by 

/n 
y2ai(y)l {Xt7 L yi} iT(dx n ,dy n ), 

where the supremum is over all vector- valued positive functions a = (ai, . . . , a n ) : X n — > W 1 , such that 

E Q [||a(Y™)|| 2 ] <1. 

The main result of |160| goes as follows: 

Theorem 42. The probability distribution P of X n satisfies the following transportation cost inequality: 



d 2 (Q,P) < \\A\\y/2D(Q\\P) (3.261) 

for all Q < P. 



Let us examine some implications: 
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1. Let X = [0, 1]. Then Theorem 1421 implies that any probability measure P on the unit cube X n = [0, l] n 
satisfies the following Euclidean log-Sobolev inequality: for any smooth convex function / : [0, l] n — > R, 



D(PW\\P) < 2||A|| 2 E^ \\\Vf{X n ) 



(3.262) 



(see [1601 Corollary 1]). The same method as the one we used to prove Proposition [8] and Theorem [22] 
can be applied to obtain from (|3.262p the following concentration inequality for any convex function 
/: [0, !]»->• R with ||/|| L ip< 1: 



P(/(X n ) > Ef(X n )+r) < exp 



2IIAI 



Vr > 0. 



(3.263) 



2. While ([3.261 j) and its corollaries, (j3.262p and ([3.263p . hold in full generality, these bounds are nontrivial 
only if the operator norm ||A|| is independent of n. This is the case whenever the dependence between 
the Xj's is sufficiently weak. For instance, if Xi, . . . ,X n are independent, then A = I nX n- In this case, 
([3.261 j) becomes 



d 2 (Q,P) < y/2D{Q\\P), 

and we recover the usual concentration inequalities for Lipschitz functions. To see some examples with 
dependent random variables, suppose that X%, . . . ,X n is a Markov chain, i.e., for each i, Xf +l is condi- 
tionally independent of X t_1 given Xi. In that case, from (|3.260p . the upper triangular part of A is given 
by 



Aj j = sup 



p x j \x i =x i - Px^x^xr 



i < J 

TV 



and || A || will be independent of n under suitable ergodicity assumptions on the Markov chain X\, . . . , X n . 
For instance, suppose that the Markov chain is homogeneous, i.e., the conditional probability distribution 
p Xi\Xi-i > 1) i s independent of i, and that 

sup WPx^x^xi - Px^x^hv < 2p 
for some p < 1. Then it can be shown (see |160[ Eq. (2.5)]) that 



\A\\<V2(l + J2p k/2 ) 



< ^ 



More generally, following Marton [161] . we will say that the (not necessarily homogeneous) Markov chain 
X\, . . . , X n is contracting if, for every i, 

5i = sup \\P Xi+1 \ Xi=Xi - Px^x^hv < 1. 

Xi )X^ 

In this case, it can be shown that 

||A|| < — t-, where 5 = max 8i. 

1 — o I i=l,...,n 
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3.5.2 Marton's transportation cost inequalities for 1? Wasserstein distance 

Another approach to obtaining concentration results for dependent random variables, due to Marton 
[1621 1163] . relies on another measure of dependence that pertains to the sensitivity of the conditional 
distributions of X , given X 1 to the particular realization x 1 of X 1 . The results of [1621 1163] are set in the 
Euclidean space 1", and center around a transportation cost inequality for the I? Wasserstein distance 



W 2 (P,Q)= inf JE\\X n -Y n \\ 2 , (3.264) 

where || • || denotes the usual Euclidean norm. 

We will state a particular special case of Marton's results (a more general development considers 
conditional distributions of (Xj : i £ S) given (Xj : j £ S c ) for a suitable system of sets S C {1, . . . , n}). 
Let P be a probability measure on R n which is absolutely continuous w.r.t. the Lebesgue measure. For 
each x n £ R n and each i £ {1, ... ,n} we denote by x l the vector in R™ -1 obtained by deleting the ith 
coordinate of x n : 

X — (x\ , . . . , , , . . . , Xn) . 

Following Marton [162] . we say that P is 5 -contractive, with < 5 < 1, if for any y n , z n £ R n 

n 

T, W *( P Xi\* i =S l > P Xi\x i =*) ^ (l-6)\\y n -z n h- (3.265) 
t=i 

Remark 48. Marton's contractivity condition (13.265|) is closely related to the so-called Dobrushin- 
Shlosman mixing condition from mathematical statistical physics. 

Theorem 43 (Marton |162t 1163] ). Suppose that P is absolutely continuous w.r.t. the Lebesgue measure 
on W 1 and 5-contractive, and that the conditional distributions Pxax 1 ) * ^ {!> • • • ' n i' nave the following 
properties: 

1. for each i, the function x n t-t PXi\x i ( x iW) 15 continuous, where Px 1 \x i - 1 ('\x l ) denotes the univariate 
probability density function of Px i \x i =x i 

2. for each i and each x % £ IR n_1 , Px i \x i =x i ~ 1 satisfies ?2(c) w.r.t. the L 2 Wasserstein distance (I3.264P 
(cf. Definition [6|) 

Then for any probability measure Q on M. n we have 

K 



W 2 (Q, P)<^— + lj ^2cD(Q\\P), (3.266) 

where K > is an absolute constant. In other words, any P satisfying the conditions of the theorem 
admits a T 2 {c') inequality with d = (K/y6 + l) 2 c. 

The contractivity criterion (j3.265j) is not easy to verify in general. Let us mention one sufficient 
condition |162| . Let p denote the probability density of P, and suppose that it takes the form 

p(x n ) = i exp (-^(x n )) (3.267) 

for some C 2 function \E' : W 1 — > R, where Z is the normalization factor. For any x n , y n £ R n , let us define 
a matrix B(x n ,y n ) £ R nxn by 

B^,yn±{ V ^ iXtQV% 1 + 3 (3.268) 
0, i = j 
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where V^i 7 denotes the entry of the Hessian matrix of F G C 2 (R n ), and XiQy 1 denotes the n-tuple 
obtained by replacing the deleted ith coordinate in y l with xf. 

XiOy 1 = {yi,...,yi-i,Xi,y i+ i,...,y n ). 

For example, if \P is a sum of one-variable and two-variable terms 

n 

* {x n ) = ^ Vi{ x i) + Yl b ii X i X 3 
i=l i<j 

for some smooth functions V% : R — > R and some constants bij £ R, which is often the case in statistical 
physics, then the matrix B is independent of x n ,y n , and has off-diagonal entries bij, i ^ j. Then (see 
Theorem 2 in [162J) the conditions of Theorem 1431 will be satisfied, provided the following holds: 

1. For each i and each x l £ R™ , the conditional probability distributions Px i \x i =x i satisfy the Euclidean 
log-Sobolev inequality 

D(Q\\p Xi \x*=*i) < lnQ\\p Xl \x^)i 

where /(-||-) is the Fisher information distance, cf. (|3.37p for the definition. 

2. The operator norms of B(x n ,y n ) are uniformly bounded as 

sup ||S(^ !y ")|| 2 <^. 

x n ,y n C 

We also refer the reader to more recent follow-up work by Marton |1641 1165] . which further elaborates 
on the theme of studying the concentration properties of dependent random variables by focusing on the 
conditional probability distributions Px^x^ i = 1, ■ ■ ■ ,n. These papers describe sufficient conditions on 
the joint distribution P of X\, . . . , X n , such that, for any other distribution Q, 

D(Q\\P)<K(P)-D-(Q\\P), (3.269) 

where D~(-||-) is the erasure divergence (cf. (I3.22j) for the definition), and the P-dependent constant 
K{P) > is controlled by suitable contractivity properties of P. At this point, the utility of a tensoriza- 
tion inequality like (13.269|) should be clear: each term in the erasure divergence 

n 

D-(Q\\P) = Y D (Qx l \x4Px l \x>\Qx>) 
i=i 

can be handled by appealing to appropriate log-Sobolev inequalities or transportation-cost inequalities 
for probability measures on X (indeed, one can just treat Pxax*^* f° r eacn fixed x l as a probability 
measure on X , in just the same way as with before), and then these "one-dimensional" bounds can 
be assembled together to derive concentration for the original "n-dimensional" distribution. 

3.6 Applications in information theory and related topics 
3.6.1 The "blowing up" lemma and strong converses 

The first explicit invocation of the concentration of measure phenomenon in an information-theoretic 
context appears in the work of Ahlswede et al. |68} 169] . These authors have shown that the following 
result, now known as the "blowing up lemma" (see, e.g., [166, Lemma 1.5.4]), provides a versatile tool 
for proving strong converses in a variety of scenarios, including some multiterminal problems: 
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Lemma 16. For every two finite sets X and y and every positive sequence e n — > 0, there exist positive 
sequences S n ,r] n — > 0, such that the following holds: For every discrete memoryless channel (DMC) with 
input alphabet X , output alphabet y, and transition probabilities T(y\x), x G X,y G y, and every n G N, 
x n G X n , and B C y n , 



T n {B\x n ) > exp(-ne n ) 



T n (B n5n \x n ) > 1- Vn . 



(3.270) 



Here, for an arbitrary B C y n and r > 0, the set £? r denotes the r-blowup of I? (see the definition in 
(|3,168p ) w.r.t. the Hamming metric 

n 

d n {y n ,u n )±Y. l {y^ Vy n ,u n ey n . 

i=l 



The proof of the blowing-up lemma, given in [68j , was rather technical and made use of a very delicate 
isoperimetric inequality for discrete probability measures on a Hamming space, due to Margulis [167J. 
Later, the same result was obtained by Marton [70J using purely information-theoretic methods. We 
will use a sharper, "nonasymptotic" version of the blowing-up lemma, which is more in the spirit of the 
modern viewpoint on the concentration of measure (cf. Marton's follow-up paper |58j): 



Lemma 17. Let Xi, . . . ,X n be n independent random variables taking values in a finite set X. Then, 
for any A C X n with P X «(A) > 0, 



Px^(A r ) > 1 — exp 



n 



2 \Px-(A) 



Proof. The proof of Lemma [17] is similar to the proof of Proposition [9l as is shown in the following: 
Consider the L 1 Wasserstein metric on V(X n ) induced by the Hamming metric d n on X n , i.e., for any 

Pn,Q n er(x n ), 



Wi(P n ,Q n )= inf E[d n (X n ,Y n )] 



inf 

X n ~P n ,Y n r 



inf 

Xn~P n ,Y"~Qr, 



E 



i=l 



i=i 



Let P n denote the product measure Px n = Pxi <8> • • • ® Px n - By Pinsker's inequality, any /i G V{X) 
satisfies Ti(l/4) on (X,d) where d = d\ is the Hamming metric. By Proposition the product measure 
P n satisfies T\(n/4) on the product space (X n ,d n ), i.e., for any \x n G V(X n ), 



Wi{ll n ,P n ) < J-D(Hn\\Pn). (3.272) 

For any set C C X n with P n {C) > 0, let P n ,c denote the conditional probability measure P n (-\C). Then, 
it follows that (see (|3.2Q3j) ) 




(3.273) 
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Now, given any A C X n with P n (A) > and any r > 0, consider the probability measures Q n = P n ^A 
and Q n = P n ,A c T - Then 

Wi(Q„,Q„) < Wi(Q n ,P n ) + Wi(Q n ,P n ) (3.274) 



/ 77- I Tt — 

< J-D(Q n \\P n ) + J -£>(Q n ||P n ) (3.275) 



=yH^WHi^k)) (3 - 276) 

where (13.274j) uses the triangle inequality, (13.2751) follows from (j3.272j) . and (I3.276|) uses (|3.273j) . Following 
the same reasoning that leads to ()3.205p . it follows that 

Wi(Q nj Q„) = ^l(Pn,A,Pn,A ? ) > ^n(A^) > r. 

Using this to bound the left-hand side of (|3.274p from below, we obtain (|3.27ip , □ 



We can now easily prove the blowing-up lemma (see Lemma [TBI) . To this end, given a positive sequence 
{e n }^L 1 that tends to zero, let us choose a positive sequence {<5 n }^Li such that 




These requirements can be satisfied, e.g., by the setting 

A /i~ /a Inn 1 

where a > can be made arbitrarily small. Using this selection for {5 n }™ =1 in (|3.27ip . we get (|3.270p 
with the r n -blowup of the set B where r n = n5 n . Note that the above selection does not depend on the 
transition probabilities of the DMC with input X and output y (the correspondence between Lemmas [TBI 
and 1171 is given by Px n = T n (-\x n ) where x n G X n is arbitrary). 

We are now ready to demonstrate how the blowing-up lemma can be used to obtain strong converses. 
Following [166] , from this point on, we will use the notation T : U — > V for a DMC with input alphabet 
U, output alphabet V, and transition probabilities T(v\u),u G U,v G V. 

We first consider the problem of characterizing the capacity region of a degraded broadcast channel 
(DBC). Let X, y and Z be finite sets. A DBC is specified by a pair of DMC's T x : X -> y and T 2 : X -> Z 
where there exists a DMC T3 : y — > Z such that 

T 2 (z\x) = Y,T 3 (z\y)Ti(y\x), VxG^zGZ. (3.277) 

y&y 

(More precisely, this is an instance of a stochastically degraded broadcast channel - see, e.g., [961 Sec- 
tion 5.6] and [TBEl Chapter 5]). Given n, M 1 ,M 2 G N, an (n, MuM 2 )-code C for the DBC (Ti,T 2 ) consists 
of the following objects: 

1. an encoding map f n : {1, . . . , Mi} x {1, . . . , M 2 } — > X n ; 

2. a collection T>\ of Mi disjoint decoding sets D\ i C 3^ n , 1 < * < Mi; and, similarly, 

3. a collection 2? 2 of M 2 disjoint decoding sets D 2 j C -Z n , 1 < j < M 2 . 
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Given < £1,62 < lj we say that C = (f n ,T>i,D 2 ) is an (n, Mi, M2, £1, e 2 )-code if 

fn(i,j)) < £1 

In other words, we are using the maximal probability of error criterion. It should be noted that, although 
for some multiuser channels the capacity region w.r.t. the maximal probability of error is strictly smaller 
than the capacity region w.r.t. the average probability of error [169], these two capacity regions are 
identical for broadcast channels [170] . We say that a pair of rates -R2) (i n nats per channel use) is 
{£i,£<i)-a.chievable if for any S > and sufficiently large n, there exists an (n, Mi, M2, £1, £2)-code with 

- In M fc > R k - 5, k = 1,2. 

n 

Likewise, we say that {Ri,R2) is achievable if it is (£1, £2)-achievable for all < £1,62 < 1- Now let 
denote the set of all (£1, £2)-achievable rates, and let TZ denote the set of all achievable rates. 

Clearly, 

K = P| K(ei,e 2 ). 

(£i,£ 2 )e(o,i] 2 

The following result was proved by Ahlswede and Korner [171] : 

Theorem 44. A rate pair (Ri,R 2 ) is achievable for the DBC (Ti,T2) if and only if there exist random 
variables UeU,XGX,Y£y,ZeZ such that U ^ X ^ Y ^ Z is a Markov chain, P Y \ X = T x , 
P Z]Y = T 3 (see dH277D), and 

Ri <I(X;Y\U), R 2 <I(U;Z). 
Moreover, the domain U of U can be chosen so that \U\ < min{\X\, \y\, \2\}. 

The strong converse for the DBC, due to Ahlswede, Gacs and Korner [68], states that allowing for 
nonvanishing probabilities of error does not enlarge the achievable region: 

Theorem 45 (Strong converse for the DBC). 

K(ei.e 2 )=K, V( £l , e 2 ) e (0, l] 2 . 

Before proceeding with the formal proof of this theorem, we briefly describe the way in which the 
blowing up lemma enters the picture. The main idea is that, given any code, one can "blow up" the 
decoding sets in such a way that the probability of decoding error can be as small as one desires (for 
large enough n). Of course, the blown-up decoding sets are no longer disjoint, so the resulting object is 
no longer a code according to the definition given earlier. On the other hand, the blowing-up operation 
transforms the original code into a list code with a subexponential list size, and one can use Fano's 
inequality to get nontrivial converse bounds. 

Proof (Theorem\45\). Let C = i K f n ,T>i,T> 2 ) be an arbitrary (n, Mi, M 2 , £i,£ 2 )-code for the DBC (Ti,T 2 ) 
with 




Let {5 n }^ =1 be a sequence of positive reals, such that 

5 n — > 0, \/n8 n —^00 as n — > 00. 



max max TV 1 ( Df ,■ 

Ki<Mi KKMi ■ 



max max To ( Do ,■ 

Ki<Mi Ki<Mo V ,J 
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For each i E {1, . . . , M\} and j E {1, . . . , M2}, define the "blown- up" decoding sets 



Dxa 



D 



l A 



and 



D2, 



D2, 



By hypothesis, the decoding sets in T>\ and T>2 are such that 



fn(i,j)) > 1 - £1 



min min TV" I D\ 

l<i<Mi l<j<M 2 



min min L> 2 j f n (i,j) > 1 - £2- 



Therefore, by Lemma [171 we can find a sequence e n — > 0, such that 



min min T-f 

l<i<Mi l<j<M 2 



Let Pi = {-Di,i} i= \, and 2?2 = {^2j}^!?l- We have thus constructed a triple {f n ,Vi,T>2) satisfying 
(|3.278|) . Note, however, that this new object is not a code because the blown-up sets D\^ C y n are not 
disjoint, and the same holds for the blow-up sets {Z?2,j}- On the other hand, each given n-tuple y n E y n 
belongs to a small number of the Da's, and the same applies to D2 ,'s. More precisely, let us define for 
each y n E y n the set 

Mi(y n ) = {i--y n £D 1>i }, 

and similarly for A^z"), z n E Z n . Then a simple combinatorial argument (see [681 Lemma 5 and 
Eq. (37)] for details) can be used to show that there exists a sequence {rj n }^Li of positive reals, such that 
rj n — >• and 



mm mm 

l<i<Afi i<j<M 2 

, Ma 



(3.278a) 
(3.278b) 



|M(y n )| < |S n5n (y n )| < exp(n % ), Vy™ E ^ 
|M(* n )| < \B n5n (z n )\ < exp(nr ? „), Vz™ E 



(3.279a) 
(3.279b) 



where, for any y n E 3^™ and any r > 0, B r {y n ) C denotes the ball of d n -radius r centered at y n : 

# r (y") 4 {„" E : d„K, y") < r} = {y"} r 

(the last expression denotes the r-blowup of the singleton set {y n }). 

We are now ready to apply Fano's inequality, just as in |171j . Specifically, let U have a uniform 
distribution over {1, . . . , M2}, and let X n E X n have a uniform distribution over the set T(U), where for 
each j E {1, . . . , M2} we let 

T(j)±{f n (i,j)--l<i<M 1 }. 

Finally, let y n E y n and Z n E Z n be generated from X n via the DMC's Tf and T 2 n , respectively. Now, 
for each z n E -£ n , consider the error event 

E n (z n ) ± {U ? AT 2 (z n )} , Vz n eZ n 

and let Cn — P (E n {Z n ))- Then, using a modification of Fano's inequality for list decoding (see Ap- 
pendix [3X1]) together with (|3.279|) . we get 



H{U\Z n ) < h(( n ) + (1 - Cn)nrj n + C n \nM 2 . 



(3.280) 
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On the other hand, lnM 2 = H(U) = I(U; Z n ) + H(U\Z n ), so 

/([/; Z n ) + h(Cn) + Cn In Af 2 l + (1 - C n )r, r: 



- In M 2 < - 

n n 

1 



-I(U;Z n ) + o(l), 



n 



where the second step uses the fact that, by (|3.278j) . ( n < e n , which converges to zero. Using a similar 
argument, we can also prove that 

-la Mi < -I(X n :Y n \U) + o(l). 
n n 

By the weak converse for the DBC pZE], the pair {R u R 2 ) with R 1 = \l{X n ; Y n \U) and R 2 = ~I(U; Z n ) 
belongs to the achievable region 1Z. Since any element of lZ{e\,e2) can be expressed as a limit of rates 
(^ In Mi, ^ I11M2), and since the achievable region 1Z is closed, we conclude that C(ei,£2) Q C for all 
£1,62 6 (0,1], and Theorem H5l is proved. □ 



Our second example of the use of the blowing-up lemma to prove a strong converse is a bit more 
sophisticated, and concerns the problem of lossless source coding with side information. Let X and 
y be finite sets, and {(Xi,Y,{)} ( ^ l be a sequence of i.i.d. samples drawn from a given joint distribution 
Pxy £ V(X x y). The <Y-valued and the y~ valued parts of this sequence are observed by two independent 
encoders. An (n, Mi, M2)-code is a triple C = {f { n\f { n\g n ), where fjP : X n -> {l,...,Mx} and 
. yn _^ ^ j^ 2 | are encoc ii n g ma ps and : {1, . . . , Mi} x {1, . . . , M 2 } — >■ y n is the decoding 

map. The decoder observes 

J^=f^m and JP=A 2 HY") 
and wishes to reconstruct Y n with a small probability of error. The reconstruction is given by 

= 5n (/( 1 )(x"),/( 2 )(y")). 

We say that C = (^fn\fn\gn) is an (n, Mi, M 2 , e)-code if 

p (r ^ y») =p ( ffn (/W(r),/( 2 '(r)) / y«) < e . (3.281) 

We say that a rate pair (i?i, R 2 ) is e-achievable if, for any 5 > and sufficiently large n 6 N, there exists 
an (ra, Mi, M2, e)-code C with 

- In M fc < R k + 5, k = 1,2. (3.282) 

n 

A rate pair (Ri,R 2 ) is achievable if it is e-achievable for all e G (0, 1]. Again, let TZ(e) (resp., TV) denote 
the set of all e-achievable (resp., achievable) rate pairs. Clearly, 

n= p| k(s). 

£6(0,1] 

The following characterization of the achievable region was obtained in |171| : 
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Theorem 46. A rate pair (i?i,i?2) is achievable if and only if there exist random variables U G U, 
X G X, Y G y, such that U — > X — > Y is a Markov chain, (X,Y) has the given joint distribution Pxy, 
and 

Ri > I(X; U) 
R2 > H(Y\U) 

Moreover, the domain U of U can be chosen so that \U\ < \X\ + 2. 

Our goal is to prove the corresponding strong converse (originally established in [68J), which states 
that allowing for a nonvanishing error probability, as in (|3.28ip . does not asymptotically enlarge the 
achievable region: 

Theorem 47 (Strong converse for source coding with side information). 

K(e)=K, Ve€(0,l]. 

In preparation for the proof of Theorem [UJ we need to introduce some additional terminology and 
definitions. Given two finite sets U and V, a DMC S : U — > V, and a parameter r\ G [0,1], we say, 
following [166], that a set B C V is an -q-image of u G hi under S if S(B\u) > r]. For any B C V, let 
V rj (B; S) C denote the set of all u GU, such that 5 is an 77-image of n under 5: 



V V (B;S) = \ueU: S{B\u) > rj\. 



Now, given Pxy £ x y), \et T : X ^ y be the DMC corresponding to the conditional probability 
distribution Py\x- Finally, given a strictly positive probability measure Qy £ V{y) and the parameters 
c > and e G (0, 1], we define 

f n (c,e;Q Y ) 4 min (-lnQ^(B) : ^lnP^ (2?i_ e (B; T") n 7^,) > -cl (3.283) 
Bey™ [n n V 1 v J 

where 7^^j C A^™ denotes the typical set induced by the marginal distribution Px- 

Theorem 48. For any c > and any e G (0, 1], 

lim f n (c, e; Qy ) = r(c; Qy), (3.284) 

where 

r(c;Qy) = - max max |D(iVl^llQy 1-%) : ^ ->• ^ ->• Y\ I(X\ U) < c) . (3.285) 

U:\U\<\X\+2 U&A 1 

Moreover, the function c h-> T(c; Qy) is continuous. 

Proof. The proof consists of two major steps. The first is to show that (|3.284j) holds for e = 0, and that 
the limit r(c;Qy) is equal to ()3.285p . We omit the details of this step and instead refer the reader to 
the original paper by Ahlswede, Gacs and Korner [68]. The second step, which actually relies on the 
blowing-up lemma, is to show that 



lim 

n— »oo 



r„(c, e; Q Y ) - T n (c, e'; Q Y )\ = (3.286) 

for any e, e' G (0, 1]. To that end, let us fix an e and choose a sequence of positive reals, such that 

8 n — > and \fnb~ n — > 00 as n — > 00. (3.287) 
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For a fixed n, let us consider any set B C y n . If T n (B\x n ) > 1 — e for some x n £ Af n , then by Lemma [T71 

/ \ 21 



T n (£ n(5 Jx n ) > 1-exp 



1 — exp 



1 - e n . 



< n 



-- n5 n - W-ln 



71 



1 



1-e 



-2 \/n^ 



2 V 1-e 



(3.288) 



Owing to (|3.287p . the right-hand side of (|3.288|) will tend to 1 as n — > oo, which implies that, for all large 
n, 



V^ En (B nSn ;T n ) n 7fo 5 2?!_ e (S; T«) n 7fo. 
On the other hand, since Qy is strictly positive, 

Qy(B n sJ= £ £&(y n ) 

v n eB nSn 



(3.289) 



y n eB 



y»6y» QyiV ) 



SU P n /„nN OH 5 )- 



Using this together with the fact that 



(see [68, Lemma 5]), we can write 



hm — in sup — ■; — = 

n^oo n y n(zy« Q Y {y n ) 



i. 1 i Qy ( B n5 nJ ... 

hm sup — In — ^„ ; — = 0. 



n->oo B ^y„ n Q'y(B) 

From ([33891) and (|3.29Up . it follows that 



(3.290) 



lim 

n— >oo 



r n (c, e; Qy) - r„(c, e n ; Qy) 



0. 



This completes the proof of Theorem 



□ 



We are now ready to prove Theorem 1471 Let C = (fn\fn\g n ) be an arbitrary (n, Mi, M-%, e)-code. 
For a given index j G {1, . . . , Mi}, we define the set 



S(j) ^ {y n ey n :y n = g n (j, ^ 2) (y n )) } , 



which consists of all y n E y n that are correctly decoded for any x n G <-t n such that fn\x n ) = j. Using 
this notation, we can write 



E 



T n {B(fW(X n ))\X n ) 



> 1-e. 



(3.291) 
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If we define the set 

A n 4 |x n e : T n (B(fV(x n ))\x n ) > 1 - y/lj , 

then, using the so-called "reverse Markov inequality "land flsTMD , we see that 

P1(^) = 1-P1(^) 

= 1 - PI (Wp | X n ) <l-Vi] 



1 — E 



> 1 

> 1 



<i 



T n ( J B(/^ 1) (X n ))|X n ) 



1 " (1 " Ve) 

1 - (1 - e) =i-yg, 



Consequently, for all sufficiently large n, we have 

This implies, in turn, that there exists some j* S fn\x n ), such that 

1-2^ 



^ pi-^(/))n7f x] > 



Mi 



On the other hand, 



Mo 



fP(Y n ) >\B(j 



(3.292) 



(3.293) 



We are now in a position to apply Theorem [48l If we choose Qy to be the uniform distribution on 
y, then it follows from (|3.292l) and (13.2931) that 

-lnM 2 > - In I 

n n 



1 



n 



\nQ Y {B{f)) + \n\y\ 



>f n ( -- ln(l - 2x/i) + IlnMi, v / i; Qy) +ln|y|. 
\ n n / 

Using Theorem 1481 we conclude that the bound 

ilnM 2 >rf-iln(l-2Ve) + -lnM i; Qy] +ln|^| + o(l) 
n \ n n J 



(3.294) 



holds for any (n, Mi, M 2 , e)-code. If (i?i,i?2) £ 7£( £ )> then there exists a sequence {C n }^ =1 , where each 
C n = (fn >/n ,9n) is an (n, M 1>n , M 2j n, e)-eode, and 



lim - hiM fc „ = R k , 

n— >oo 77, 



fc = 1,2. 



The reverse Markov inequality states that if Y is a random variable such that Y < b a.s. for some constant b, then for 



all a < b 



K Y <a)< 



b - E[Y] 
b — a 
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Using this in (|3.294p . together with the continuity of the mapping c i— > T(c; Qy), we get 

R 2 >T(R 1 ;Q Y )+ln\y\, \f(R 1 ,R 2 )£TZ(e). (3.295) 

By definition of T in f|3.285 j) . there exists a triple U — > X — > Y such that I(X; U) < R\ and 

T{Rx-Q Y ) = -D(P Ylu \\Q Y \Pu) = ~ In \y\ + H(Y\U), (3.296) 

where the second equality is due to the fact that U — > X — > Y is a Markov chain and Qy is the uniform 
distribution on y. Therefore, (|3.295 j) and (|3.296[) imply that 

R2 > H{Y\U). 

Consequently, the triple (U, X, Y) G TZ by Theorem l46l and hence 71(e) C TZ for all e > 0. Since TZ C TZ(e) 
by definition, the proof of Theorem I47I is completed. 

3.6.2 The empirical distribution of good channel codes with non- vanishing error 
probability 

A more recent application of concentration of measure to information theory has to do with characterizing 
stochastic behavior of output sequences of good channel codes. On a conceptual level, the random coding 
argument originally used by Shannon, and many times since, to show the existence of good channel codes 
suggests that the input (resp., output) sequence of such a code should resemble, as much as possible, a 
typical realization of a sequence of i.i.d. random variables sampled from a capacity-achieving input (resp., 
output) distribution. For capacity-achieving sequences of codes with asymptotically vanishing probability 
of error, this intuition has been analyzed rigorously by Shamai and Verdii |172j . who have proved the 
following remarkable statement [172\ Theorem 2]: given a DMC T : X — > y, any capacity-achieving 
sequence of channel codes with asymptotically vanishing probability of error (maximal or average) has 
the property that 

lim -D(P Y n\\P Y n) = 0, (3.297) 

n— >oo n 

where, for each n, Py™ denotes the output distribution on y n induced by the code (assuming the messages 
are equiprobable) , while P Y „ is the product of n copies of the single- letter capacity-achieving output 
distribution (see below for a more detailed exposition). In fact, the convergence in (|3.297p holds not just 
for DMC's, but for arbitrary channels satisfying the condition 

C= lim - sup I(X n ;Y n ). 

n ~*°° n p xn( z V ( X n) 

In a recent preprint |173j . Polyanskiy and Verdii extended the results of |172] for codes with nonvanish- 
ing probability of error, provided one uses the maximal probability of error criterion and deterministic 
encoders. 

In this section, we will present some of the results from [173j in the context of the material covered 
earlier in this chapter. To keep things simple, we will only focus on channels with finite input and output 
alphabets. Thus, let X and y be finite sets, and consider a DMC T : X —> y. The capacity C is given 
by solving the optimization problem 

C= max I(X:Y), 
P x eV(x) 

where X and Y are related via T. Let Pt G V(X) be any capacity-achieving input distribution (there 
may be several). It can be shown ( |174| UTS"] ) that the corresponding output distribution P Y G V{y) is 
unique, and that for any n G N, the product distribution Py n = (P Y )® n has the key property 

D(Pyn\ X n =x n\\Py n ) <nC, \ 'i" G X" (3.298) 
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where Pyn\ X n =x n is shorthand for the product distribution T n (-|x n ). From the bound (|3.298p . we see 
that the capacity-achieving output distribution Py„ dominates any output distribution Pyn induced by 
an arbitrary input distribution P X n G T , (X n ): 

P Y u lXn=x n < Py„, VX™ G X n =^ Pyn < Py„ , VP X ™ £ V (*") ■ 

This has two important consequences: 

1. The information density is well-defined for any x n G X n and y n G y n : 

2 X n.yn(X ,1/ )— in -— ; 

2. For any input distribution Px n , the corresponding output distribution Pyn satisfies 

D(Pyn||P*„) < nC - I{X n ;Y n ). 
Indeed, by the chain rule for divergence, it follows that for any input distribution P X n G V(X n ) 

I(X n ;Y n ) = D{Py n \ xn \\Py n \P Xn ) 

= D(P Y n lX „\\Py n \P X ^) ~ D(P Y n\\Py n ) 

< nC - D(Pyn\\P* n ). 
The claimed bound follows upon rearranging this inequality. 

Now let us bring codes into the picture. Given n, M G N, an (n, M)-code for T is a pair C = (f n ,9n) 
consisting of an encoding map f n : {1, . . . , M} — > X n and a decoding map g n : y n — )■ {1, . . . , M}. Given 
< e < 1, we say that C is an (n, M, e)-code if 

max F(g n (Y n ) / i\X n = / n (i)) < e. (3.299) 

Remark 49. Polyanskiy and Verdu |173j use a more precise nomenclature and say that any such C = 
(fn,9n) satisfying (|3.299|) is an (n,M, 

£-)max det ~code to indicate explicitly that the encoding map f n is 
deterministic and that the maximal probability of error criterion is used. Here, we will only consider 
codes of this type, so we will adhere to our simplified terminology. 

Consider any (n, M)-code C = (f n ,9n) f° r Pi an d let J be a random variable uniformly distributed on 

{1, . . . , M}. Hence, we can think of any 1 < % < M as one of M equiprobable messages to be transmitted 

(C) (C) 
over T. Let P x „ denote the distribution of X n = f n (J), and let Py n denote the corresponding output 

distribution. The central result of |173j is that the output distribution P Y J of any (n, M, e)-code satisfies 

D(P$\\Pyn) < nC- In M + o(n); (3.300) 

moreover, the o(n) term was refined in [1731 Theorem 5] to 0(y/n) for any DMC, except those that have 
zeroes in their transition matrix. In the following, we present a sharpened bound with a modified proof, 
in which we specify an explicit form for the term that scales like 0(y/n). 

Just as in |173| . the proof of (I3.300P with the 0(y/n) term uses the following strong converse for 
channel codes due to Augustin |176j (see also \173\ Theorem 1] and [1771 Section 2]): 

Theorem 49 (Augustin). Let S : U — > V be a DMC with finite input and output alphabets, and let 
Pym be the transition probability induced by S. For any M G N and < e < 1, let / : {1, . . . , M} — > U 
and 9 : V — > {1, . . . , M} be two mappings, such that 

max F(g(V) ^ i\U = f(i)) < e. 
Ki<M j \ >) - 
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Let Qv G P(V) be an auxiliary output distribution, and fix an arbitrary mapping 7 : U — > R. Then, the 
following inequality holds: 



M < 



exp{E[ 7 (C/)]} 



inf R 



V|£/=u 



In- 



dP, 



V|t/=u 



dQy 



< 70) 



(3.301) 



provided the denominator is strictly positive. The expectation in the numerator is taken w.r.t. the 
distribution of U = f(J) with J ~ Uniformjl, . . . , M}. 



We first establish the bound ()3.300p for the case when the DMC T is such that 



max D(Py\x=x\\Py\x=x / ) < 



00. 



x,x'ex 

Note that C\ < 00 if and only if the transition matrix of T does not have any zeroes. Consequently, 



(3.302) 



c(T) = 2 max max 

x,x'ex y ,y'ey 



In 



Pi 



|x) 



< 00. 



PY\x(y'W) 

We can now establish the following sharpened version of the bound in [1731 Theorem 5]: 



(3.303) 



Theorem 50. Let T : X -> y be a DMC with C > satisfying (J3.302I) . Then, any (n,M, e)-code C for 
T with < s < 1/2 satisfies 



D(P^J\\P Yn ) < nC- lnM + In- + c(T) 



1 



71 1 

— In . 

2 1 - 2e 



(3.304) 



Remark 50. As shown in |173j . the restriction to codes with deterministic encoders and to the maximal 
probability of error criterion is necessary both for this theorem and for the next one. 



Proof. Fix an input sequence x n £ X n and consider the function h x n : y Tl 



defined by 



h x n (y n 



Then E[h x n(Y n )\X r 



D(B 



Y n \X n =x rl 



In- 



(C) 

Pyn). Moreover, for any i G {!,..., n}, y^y G y, and 



dP 



Y n \X n =x r ' 



dP 



(C) 



y r, 



yt g -yn 1^ we jj ave ^ gee notation used in (|3.24|> ) 



hi, x ™(y\y l ) - h itX n(y'\y l )\ < |lnPyn| X n =a .n(?/ 1 , y, - In Pyn| X n =x n (y* i , y', yf+i) 



+ 



lnP^^-Sy,^) - lnP^l(y i -\j/,y^ 1 ) 



< 



In 



(v) 



Py 1^=^(2/0 



+ 



In 



< 2 max max 
= c(T) < 00 



In 



Pr\x(y\x) 



Py\x(y'W) 



(3.305) 
(3.306) 



(see Appendix I3.DI for a detailed explanation of the inequality in (|3.305[) ) . Hence, for each fixed x n £ X n , 
the function h x n : y n — y R satisfies the bounded differences condition (|3.138p with c\ = . . . = c n = c(T). 
Theorem 1281 therefore implies that, for any r > 0, we have 



Py n \x n =x n 



In- 



dP 



Y n \X n =x r! 



{Y n ) > D(Py»\ x «=*"\\PY$) +r\ < exp 



dP 



(C) 



y n 



2r 2 



nc 2 (T) 



(3.307) 
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(In fact, the above derivation goes through for any possible output distribution Py n , not necessarily one 
induced by a code.) This is where we have departed from the original proof by Polyanskiy and Verdii 
|173j : we have used McDiarmid's (or bounded differences) inequality to control the deviation probability 
for the "conditional" information density h x n directly, whereas they bounded the variance of h x n using a 
suitable Poincare inequality, and then derived a bound on the derivation probability using Chebyshev's 
inequality. As we will see shortly, the sharp concentration inequality (13.307P allows us to explicitly identify 
the dependence of the constant multiplying y/n in (|3.304|) on the channel T and on the maximal error 
probability e. 

We are now in a position to apply Augustin's strong converse. To that end, we let U = X n , V = y n , 
and consider the DMC S = T n together with an (n,M, e)-code (/, g) = (f n ,gn)- Furthermore, let 



tn = Ue)=c{T)^\n T ± Ye (3.308) 

and take r y{x n ) = D(PYn\ X n =x n\\P Y n) + Cn- Using (|3.30ip with the auxiliary distribution Qy = P Y u , we 
get 

exp{E[7(X n )]} 

M < yx l — ^ ^ (3.309) 

mi P Y n\ X n- r n 111 < ^(X ) I — £ 

x n EX n \* ~* ^ dp (C) >) 

where 

E[j(X n )] = D(Pyn lX 4P$ | Pg) + Cn- (3.310) 

The concentration inequality in ()3.307p with Cn in (j3.308p therefore gives that, for every x n 6 X n , 



Y 



Py n \ Xn=xn In j-r > j(x n ) < exp I j— 



2 



which implies that 



1 - 2e 



(d Py n I X 71 — X 71 \ 
In — K ; ~ < 1 ( x n ) > 2s. 
dP$ J ~ 



Hence, from (|3.3Q9|) . (|3.310j) and the last inequality, it follows that 

M < - exp (p(P Y n\ X n\\P$ | Pjg) + Cn 

so, by taking logarithms on both sides of the last inequality and rearranging terms, we get from (|3,308p 
that 

D(P Y n lX n\\P$ | pj$ ) > In M + In e - Q n 



n I 

= InM + me -c(TW- In — — . (3.311) 

We are now ready to derive (|3.304|) : 

D [Pyn 1 1 Pyn ) 

= D {Pyn\ X n || Pyn \P X n ) — D(pyn\ X n 

\\P^\P ( X 1) (3.312) 

1 Fn I 

< nC- InM + In- + c{T)J- In- — — (3.313) 

where (|3.312|) uses the chain rule for divergence, while (|3.313p uses (|3.298p and (|3.31ip . This completes 
the proof of Theorem [50j □ 
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For an arbitrary DMC T with nonzero capacity and zeroes in its transition matrix, we have the 
following result which forms a sharpened version of the bound in [173, Theorem 6]: 

Theorem 51. Let T : X -> y be a DMC with C > 0. Then, for any < e < 1, any (n, M, e)-code C for 
T satisfies 

D{P$\\P$ n ) < nC- In M + O (^n" (In n) 3/2 ) . 
More precisely, for any such code we have 



D(P^\\Pp, 



< nC - In M + \/2n (Inn) 3/2 ^1 + In ^JL_^ 



1 +^) +31nn + ln(2|A'||3;| 2 ). 



Inn 



(3.314) 



Proof. Given an (n, M, e)-code C = (f n , g n ), let c±, . . . , cm £ ^ n be its codewords, and let Z?i, . . . , Dm C 
3^ n be the corresponding decoding regions: 



A = ^(i) = {y n G : 3n (y") = »} , i = 1, . . . , M. 



If we choose 



<5 n = <5«(e) 



n 



Inn / 1 1 
n | a/ — h a/ — In- 



2n V2n 1 - e 



(3.315) 



(note that n5 n is an integer) , then by Lemma [T7] the "blown- up" decoding regions Di 



Di 



satisfy 



P Yn \ Xn=Ci (Df) < exp 



-In S„ 



1 In- 1 



1 

< - 

n 



2n l-e 
Vi€{l,...,M}. 
We now complete the proof by a random coding argument. For 

N ■ M 



)|37|"*» 



(3.316) 



(3.317) 



let Z7i, . . . , Um be independent random variables, each uniformly distributed on the set {1, . . . , M}. For 
each realization V = U N , let Px n {v) G V(X n ) denote the induced distribution of X n (F) = f n (cj), where 
J is uniformly distributed on the set {Ui, . . . , Un}, and let Pyn(y) denote the corresponding output 
distribution of Y n (V): 



1 N 



i=l 

,(C) 



(3.318) 



It is easy to show that E [iV n (v)] = Pyn, the output distribution of the original code C, where the 
expectation is w.r.t. the distribution of V = XJ . Now, for V = U N and for every y n G y n , let Mv{y n ) 
denote the list of all those indices w. {U\, . . . ,Un} such that y n G D\j.: 



Mv(y n ) = {j:y n £D Uj } 
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Consider the list decoder Y n h-> M\/(Y n ), and let s(V) denote its conditional decoding error probability: 
e(V) = P(J Mv(Y n )\V). Then, for each realization of V, we have 

D{Py n (y} JFy-n) 

= D(Py n{v)lX n (v) \\P^ n \P xn(v) ) - I(X n (V);Y n (V)) (3.319) 
<nC - I{X n {V);Y n {V)) (3.320) 

< nC - I(J;Y n (V)) (3.321) 
= nC - H(J) + H(J\Y n (V)) 

< nC-lnN+ (1 -e(y))E[ln|A/V(^ n )|] +ne(V)ln\X\ +ln2 (3.322) 

where: 

(|3.319p is by the chain rule for divergence; 
(133201) is by (^298]) : 

(|3.32ip is by the data processing inequality and the fact that J —> X n (V) — > Y n (V) is a Markov chain; 
and 

(13.322j) is by Fano's inequality for list decoding (see Appendix l3.Cp . and also since (i) N < \X\ n , (ii) J is 
uniformly distributed on {Ui, . . . , Un}, so H{J\U\, ... ,Un) = In N and H(J) > IniV. 

(Note that all the quantities indexed by V in the above chain of estimates are actually random variables, 
since they depend on the realization V = U N .) Now, from (|3.317p . it follows that 

IniV > hxM- Inn - In ( " ) - n5 n ln\y\ 



nb n/ 

> In M — Inn — n5 n (Inn + In \y\) (3.323) 

where the last inequality uses the simple inequality (^) < n k for k < n with k = nb n (it is noted that the 
gain in using instead the inequality ) < exp(n h(6 n )) is marginal, and it does not have any advantage 
asymptotically for large n). Moreover, each y n 6 y n can belong to at most (Jl )\y\ nSn blown-up decoding 
sets, so 

In \N v (Y n = y n )\< In (j \ + n5 n In \y\ 

<n6 n Qnn + hi\y\), Vy" € y n . (3.324) 
Substituting (I3T323D and (^324"]) into (^3221) . we get 

D(P Y n iv) \\Pp„) < nC - In M + Inn + 2n£ n (Inn + In + ne{V) In |#| + In 2. (3.325) 

Using the fact that E [-Fyn(y)] = Pyn-, convexity of the relative entropy, and (|3.325j) . we get 

D(P$\\Pyu) < nC-\nM + \nn + 2n6 n (]nn + ]n\y\) +nE[e(V)] In |AT| +ln2. (3.326) 
To finish the proof and get (13.3141) . we use the fact that 

E[e(V)l < max Py n , x „- C . (Df) < -, 

which follows from (|3.316p . as well as the substitution of (|3.315|) in (|3.326p (note that, from (|3.315|) . it 
follows that 5 n < + \ ^ In + ^). This completes the proof of Theorem I5T1 □ 
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We are now ready to examine some consequences of Theorems [50] and [5"T1 To start with, consider a 
sequence {C n }^ =1 , where each C n = (f n ,9n) is an (n, M n , e)-code for a DMC T : X — > y with C > 0. We 
say that {C n }^ =1 is capacity- achieving if 



lim ilnM n = C. (3.327) 

n— >-oo n 



Then, from Theorems 1501 and 1511 it follows that any such sequence satisfies 

lim -D(pffi\\P Y n) =0. (3.328) 

Moreover, as shown in |173j . if the restriction to either deterministic encoding maps or to the maximal 
probability of error criterion is lifted, then the convergence in f|3.328|) may no longer hold. This is in 
sharp contrast to [172, Theorem 2], which states that (|3.328|) holds for any capacity- achieving sequence 
of codes with vanishing probability of error (maximal or average). 

Another remarkable fact that follows from the above theorems is that a broad class of functions 
evaluated on the output of a good code concentrate sharply around their expectations with respect to the 
capacity-achieving output distribution. Specifically, we have the following version of [173^ Proposition 10] 
(again, we have streamlined the statement and the proof a bit to relate them to earlier material in this 
chapter) : 

Theorem 52. Let T : X -> y be a DMC with C > and d < oo. Let d : y n x y n -»• M + be a 
metric, and suppose that there exists a constant c > 0, such that the conditional probability distributions 
Py n \x ri =x n i xU ^ X n as well as Py„ satisfy T\{c) on the metric space (y n ,d). Then, for any e E (0,1), 
there exists a constant a > that depends only on T and on e (to be defined explicitly in the following), 
such that for any (n, M, e)-code C for T and any function / : y n — > R we have 

( 1/ m - nf(Y* n )}\ > r) < \ ■ exp (^nC - InM + oy^- ^jJfiS ) ' Vr ^ ( 3 ' 329 ) 

where E[/(y* n )] designates the expected value of f(Y n ) w.r.t. the capacity-achieving output distribution 
p* 

,,,,, a. qim \f(y n )-f(v n )\ 

/Lip - SUp — 

yn ^ v „ d{y n ,v n ) 
is the Lipschitz constant of / w.r.t. the metric d, and 



with c(T) in (^3031) . 



Remark 51. Our sharpening of the corresponding result from [173, Proposition 10] consists mainly 
in identifying an explicit form for the constant in front of y/n in the bound (]3.329p ; this provides a 
closed-form expression for the concentration of measure inequality. 

Proof. For any /, define 

H* f =E[f(Y* n )}, cj)(x n ) =E[f(Y n )\X n = x n ], Vx n £X n . (3.331) 

Since 6cich Py' n \x n, —x n 

satisfies Ti(c), by the Bobkov-Gotze theorem (Theorem [36]) , we have 



\f(Y n )-</>(x n )\>rX n = x n ) <2ex P |- ^^- j , Vr > 0. (3.332) 
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Now, given C, consider a subcode C with codewords x n G X n satisfying <f>(x n ) > fx** + r for r > 0. 
The number of codewords M' of C satisfies 

M' = MP^l ((/){X n ) >^*f + r). (3.333) 

(C r ) 

Let Q = Py n be the output distribution induced by C . Then 

a; n £ codewords (C) 

= E Q [f(Y n )] (3.335) 
< E[/(y™)] + \\f\\ Lip \/2cD(Q Y n\\P* n ) (3.336) 



< iff + 11/ 1 1 Lip y 2c (^nC - In M' + + In ^ , (3.337) 

where: 

(|3334l is by definition of C'; 

(|3.335p is by definition of <fi in (|3.33ip : 

(I3.336P follows from the fact that Pyn satisfies Ti(c) and from the Kantorovich-Rubinstein formula 
(PL220D ; and 

(|3.337p holds for the constant a = a(T,e) > in (I3.330P due to Theorem 1501 (see ()3.304p ) and because C 
is an (n, M', e)-code for T. 

From this and f|3.333j) . we get 

r < ||/||Lip 
so, it follows that 

P$l U{X n ) > fj,* f + r) < exp ( nC - InM + + In - - 



/ 2c [nC - In M - In Pjg (<K^ n ) > /u* + r) + + In ^ 



Following the same line of reasoning with — / instead of /, we conclude that 

P { £l(\cj,{X n ) - if f \ >r) <2exp(nC-lnM + av^ + lni-— ^2") . (3.338) 



Lip 



Finally, for every r > 0, 

4 C "(|/on-M/l> 

< 4 C iy« ( l/On - «^(^ n )l > r/2) + Pfi ( |0(X") - tx}\ > r/2 



- 2exp r« + 2exp r ~ lnM + a ^ + ir 4 - « (3 - 339) 



, r 2 



< 4 exp ( nC — In M + a\fn + In „ ,,. (3.340) 

V £ 8c ll/ll L ip/ 
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where (133391) is by (133321) and (^338]) . while (^MOl) follows from the fact that 



nC - In M + > D(P$ \\Py n ) > 



by Theorem 1501 and from (|3.330[) . This proves ()3.329p . □ 
As an illustration, let us consider y n with the product metric 

n 

d{y n ,v n ) = Y J hy^} (3-341) 

i=l 

(this is the metric d\ n induced by the Hamming metric on y). Then, any function / : y n — > R of the 
form 

n 

/(y n ) = ~E-^)> vy n cy n (3-342) 

i=l 

where /i, . . . , f n : y — > M. are Lipschitz functions on y, will satisfy 

ll/lllip <-, L= max ||/i||Li P - 

TL l<i<n 

Any probability distribution P on ^ equipped with the Hamming metric satisfies Ti(l/4) (this is simply 
Pinsker's inequality); by Proposition llll any product probability distribution on y n satisfies Ti(n/4) w.r.t. 
the product metric (|3.34ip . Consequently, for any (n, M, e)-code for T and any function / : y n — >• R of 
the form (|3.342|) . Theorem 1521 gives the concentration inequality 

P$ ( \f(Y n ) ~ nf(Y* n )}\ > r) < - £ exp (nC -lnM + a^- Jjf~) ' W - °' (3 ' 343) 

Concentration inequalities like (|3.329p or its more specialized version (|3.343j) . can be very useful in 
characterizing various performance characteristics of good channel codes without having to explicitly 
construct such codes: all one needs to do is to find the capacity-achieving output distribution Pp and 
evaluate E[/(Y* n )] for any / of interest. Then, Theorem 1521 guarantees that f(Y n ) concentrates tightly 
around K[f(Y* n )], which is relatively easy to compute since P y „ is a product distribution. 

Remark 52. This sub-section considers the empirical output distributions of good channel codes with 
non- vanishing probability of error via the use of concentration inequalities. As a concluding remark, it is 
noted that the combined result in [1781 Eqs. (A17), (A19)] provides a lower bound on the rate loss with 
respect to fully random block codes (with a binomial distribution) in terms of the normalized divergence 
between the distance spectrum of the considered code and the binomial distribution. This result refers 
to the empirical input distribution of good codes, and it was derived via the use of variations on the 
Gallager bounds. 



3.6.3 An information-theoretic converse for concentration of measure 

If we were to summarize the main idea behind concentration of measure, it would be this: if a subset of a 
metric probability space does not have a "too small" probability mass, then its isoperimetric enlargements 
(or blowups) will eventually take up most of the probability mass. On the other hand, it makes sense to 
ask whether a converse of this statement is true — given a set whose blowups eventually take up most of 
the probability mass, how small can this set be? This question was answered precisely by Kontoyiannis 
[179 J using information-theoretic techniques. 
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The following setting is considered in [179J: Let X be a finite set, together with a nonnegative 
distortion function d : X x X — > M + (which is not necessarily a metric) and a strictly positive mass 
function M : X — > (0, oo) (which is not necessarily normalized to one). As before, let us extend the 
"single- letter" distortion d to d n : X n — > M. + , n G N, where 

n 

d n (x n ,y n ) 4 Y J d(x l ,y t ), Vx n ,y n G X n . 
i=i 

For every n G N and for every set C C Af n , let us define 

M n (C) = ^ M n (x n ) 

where 

n 

M n (x n ) = Yl M( Xi ), Vx n G X n . 
i=i 

As before, we define the r-blowup of any set A C X n by 

A r 4 {x n G Af n : d n (x n ,A) < r}, 

where <i n (x n ,j4) = min^ng^ d n (x n , y n ). Fix a probability distribution P G "P(Af), where we assume 
without loss of generality that P is strictly positive. We are interested in the following question: Given 
a sequence of sets AW C X n , n G N, such that 

P® n (A^) 1, asn^oo 

for some 5 > 0, how small can their masses M n (A w ) be? 

In order to state and prove the main result of [179 J that answers this question, we need a few prelim- 
inary definitions. For any n G N, any pair P n , Q n of probability measures on X n , and any 5 > 0, let us 
define the set 

n n (P n , Q n , 8) 4 L n G n n (P n , Q n ) : [d n (X™, F n )] < <A (3.344) 

of all couplings ir n G V(X n x Af n ) of P n and Q n , such that the per-letter expected distortion between X n 
and Y n with (X n ,Y n ) ~ 7r n is at most S. With this, we define 

I n (Pn,Qn,5)= Ulf L>(7T n ||P n (g) Q n ) , 

Tnen„(p n ,Q n ,5) 

and consider the following rate function: 
R n (5) = R n (5;P n ,M n ) 

inf |/ n (P n ,Q n ,<5) + E Qn [In M"(Y™)]) 

= inf \l(X n ;Y n )+E[lnM n (Y n )} : P X n = P n ,-E[d n (X n ,Y n )] < s) . 

P x n Y n { n J 

When n = 1, we will simply write H(P,Q,5), I{P,Q,5) and P(<5). For the special case when each P n is 
the product measure P® n , we have 

R(S) = lim -Rn{6) = inf -R n (5) (3.345) 

n— s>oo n n>i n 

(see [1791 Lemma 2]). We are now ready to state the main result of [179J: 



Q n eV(X") 
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Theorem 53. Consider an arbitrary set A^ C X n , and denote 

5 = -E[d n (X n ,A^)]. 
n 

Then 

-\nM n (A {n) ) > R(5;P,M). (3.346) 
n 

Proof. Given A n C X n , let (p n : X n — > A n be the function that maps each x n G X n to the closest element 
y n G A„, i.e., 

d„(x n ,^n(» n )) = d n (x n , An) 

(we assume some fixed rule for resolving ties). If X n ~ P® n ; then let Q n S "P(Af") denote the distribution 
of F n = (p n (X n ), and let vr„ e -p(Af n x A" 1 ) denote the joint distribution of X" and Y n : 

Q n (x n ,y n ) = P® n (x n )l {yn=lpn(xn)} . 

Then, the two marginals of ir n are P® n and Q n and 

E 7r„K(^ n ,^ n )] = E nn [d n (X n ,<p n (X n ))] 
= E Wn [d n (X n ,A n )] 
= n5, 

so 7r n E n n (P® n ,Q n ,5). Moreover, 

hxM n (A n )=hx £ M"(y") 

M n (y 



In £ Q n (y" 



71 \ 



= Y. ^ xn ^ ln pJntn)fL) + £ Qn(y n )lnM^) (3.348) 

= I(X n ;Y n )+E Qn [lnM n (Y n )} (3.349) 
> Rn(S), (3.350) 

where ()3.347p is by Jensen's inequality, (|3.348j) and (|3.349[) use the fact that 7r ra is a coupling of P® n and 
Q n , and (j3.350f) is by definition of Rn(S). Using (|3,345|) . we get ()3.346p . and the theorem is proved. □ 

Remark 53. In the same paper |179j . an achievability result was also proved: For any S > and any 
e > 0, there is a sequence of sets A^ C X n such that 



n 

and 



-\nM n {A^)<R{5)+e, Vn e N (3.351) 



-d n (X n ,A^) < 5, eventually a.s. (3.352) 
n 
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We are now ready to use Theorem [53] to answer the question posed at the beginning of this sec- 
tion. Specifically, we consider the case when M = P. Defining the concentration exponent R c (r;P) = 
R(r; P, P), we have: 

Corollary 11 (Converse concentration of measure). If A^ C X n is an arbitrary set, then 



where 5 = ±E [d n (X n , A^)] . Moreover, if the sequence of sets {A^}™ =1 is such that, for some 5 > 0, 



p®n ( A (n) | _ | as „ _, x |b( , n 



liminf- lnP® n (^ (n) ) > R C (S;P). (3.354) 



n— »oc n 



Remark 54. A moment of reflection shows that the concentration exponent R C (5;P) is nonpositive. 
Indeed, from definitions, 

R C (5;P) =R(S;P,P) 

= inf f I(X;Y) +E[lnP(F)] : P x = P, E[d(X,Y)] < s\ 
Pxy y- J 

= inf \H(Y) - H(Y\X) +E[ln P(Y)\ : P x = P, E[d(X,Y)] < d\ 
Pxy y ' 

= inf { - D{Py\\P) - H{Y\X) : P x = P, E[d(X,Y)] < s\ 
Pxy J 

= - sup \d{P y \\P) + H{Y\X) :P x = P, E[d(X,Y)) < d\, (3.355) 
Pxy 1 J 

which proves the claim, since both the divergence and the (conditional) entropy are nonnegative. 

Remark 55. Using the achievability result from |179j (cf. Remark [53]), one can also prove that there 
exists a sequence of sets { J 4^ n ^}^ =1 , such that 

lim P® n (A^) = 1 and lim -lnP® n (a^A < R C (5;P). 



n^oo \ n ^ / n^oo n 



As an illustration, let us consider the case when X = {0, 1} and d is the Hamming distortion, 
d(x,y) = l{ X7 L y j. Then X n = {0,1}™ is the n-dimensional binary cube. Let P be the Bernoulli^) 

probability measure, which satisfies a T\ ^ 2 <p( P ) J transportation-cost inequality w.r.t. the L 1 Wasserstein 
distance induced by the Hamming metric, where pip) is defined in (|3.197|) . By Proposition [TUl the 
product measure P® n satisfies a T\ f 2 ^(p) ) transportation-cost inequality on the product space (X n ,d n ). 
Consequently, it follows from (|3.208p that for any 5 > and any A^ C X n , 



p®n ( A (nA > 1 _ exp f _m ( nS - 4 /_£_ l n L 



n \ V (p{p) P® n (AW) 



l-expl-n^p) U-J^^—^—\ |. (3.356) 




Thus, if a sequence of sets A( n ) C X n , neff, satisfies 



liminf-lnP®" (A^ n A > -<pip)5 2 , (3.357) 
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then 



The converse result, Corollary II 11 says that if a sequence of sets A^ C X n satisfies (0058]) . then ([3SMD 
holds. Let us compare the concentration exponent P C (<5;P), where P is the Bernoulli(p) measure, with 
the exponent —ip(p)5 2 on the right-hand side of (13.3571) : 

Theorem 54. If P is the Bernoulli(p) measure withp £ [0, 1/2], then the concentration exponent R c (5; P) 
satisfies 

Rc(S;P) < -<p{p)5 2 - (1 - p)h ( , V<5€[0,l-p] (3.359) 



1 — p 
and 

R C (5;P) = hip, V5€[l-p,l] (3.360) 

where h(x) = — xlnx — (1 — x) ln(l — x), x £ [0, 1], is the binary entropy function (in nats). 
Proof. From (|3.355j) . we have 

R C (5;P) = - sup \D(P Y \\P) + H(Y\X) : P x = P, F(X ^Y)< s\. (3.361) 

PXY L J 

For a given (5 £ [0, 1 — p], let us choose Py so that \\Py — P||tv = 8. Then from (|3.199|) . 

£(Py||P) _ D(Py\\P) 
5 2 \\P Y -P 112 



> inf 



TV 

D(Q\\P) 



q \\Q-P\\ 2 TV 
= <p(p). (3.362) 

By the coupling representation of the total variation distance, we can choose a joint distribution P^ Y 
with marginals = P and P^> = Py , such that P(X 7^ F) = ||Py — P||tv = 8. Moreover, using ()3.195p . 
we can compute 

P Y\x=o = Bernoulli (t4^) and p y\x=M = 5 i{y) = hy=i}- 

Consequently, 



H(Y\X) = (1 - p)#(Y|X = 0) = (1 - p)h ( — ) . (3.363) 



From (13.361 . ()3.362p and (I3.363[) . we obtain 

R C (S; P) < -D{P Y \\P) - H(Y\X) 
< -y{p)8 2 - (1 -p)h 



1-p 



To prove (|3.36Q|) . it suffices to consider the case where 5 = 1 — p. If we let Y be independent of X ~ P, 
then /(X; Y) = 0, so we have to minimize Eg [In P(Y)] over all distributions Q of Y. But then 

min Eg [In P(Y)] = min lnP(y) = minjlnp, ln(l — p)} = lnp, 

Q ;/e{0,i} 

where the last equality holds since p < 1/2. □ 
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3. A Van Trees inequality 

Consider the problem of estimating a random variable Y ~ Py based on a noisy observation U = ^Y+Z, 
where s > is the SNR parameter, while the additive noise Z ~ G is independent of Y . We assume that 
Py has a differentiable, absolutely continuous density py with I(Y) < oo. Our goal is to prove the van 
Trees inequality (|3.6ip and to establish that equality in (|3.6ip holds if and only if Y is Gaussian. 

In fact, we will prove a more general statement: Let (p(U) be an arbitrary (Borel-measurable) estimator 
of y. Then 



E [(Y - ^(U)) 2 } > 



1 



s + J{Y) ' 



(3.364) 



with equality if and only if Y has a standard normal distribution, and <p(U) is the MMSE estimator of 
Y given U. 

The strategy of the proof is, actually, very simple. Define two random variables 

A(U,Y)±<p(U)-Y, 



?(U,Y)± ^ln[p ulY (U\y)py(y)} 



d 



y=Y 



— In [j(U - \fsy)py(y)] 

V~s(U -V^Y) + p Y (Y) 
V^Z + PY (Y) 



y=Y 



where py(y) — In Py(y) for y 6 R is the score function. We will show below that E[A(f7, Y)T(U, Y)] 
1. Then, applying the Cauchy-Schwarz inequality, we obtain 

1 = \E[A(U,Y)T(U,Y)}\ 2 
< E[A 2 (U,Y)]-E[T 2 (U,Y)} 
= E[( ( p(U)-Y) 2 ].E[(^Z + py(Y)) 2 ] 
= EMU)-Y) 2 ].(s + J(Y)). 



Upon rearranging, we obtain (|3.364|) . Now, the fact that J(Y) < oo implies that the density py is 
bounded (see }130t Lemma A.l]). Using this together with the rapid decay of the Gaussian density 7 at 
infinity, we have 



_d_ 

dy 



[pu\Y(u\y)py(y)] dy = 7(14 - Vsy)p Y (y) 



0. 



(3.365) 



Integration by parts gives 
d 

-00 dy 



00 d 

V-T- [Pu\Y(u\y)py(y)\ dy = yj{u - Vsy)p Y (y) 



Pu\Y(u\y)pY(y)dy 



Pu\Y(u\y)py(y)dy 
Pu{u). 



(3.366) 
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Using (|3.365p and (|3.366j) . we have 
E[A(E7, Y)T(U, Y)] 



oo />oo 



— oo J — oo 
oo />oo 



(tp(u) - y) In [pu\Y(u\y)PY(y)] Pu\Y(u\y)PY(y)dudy 
(ip(u) - y) — [pu\Y(u\y)pY(y)] dudy 



-oo J — oo 

oo / />oo ^ 



oo 



OC 



dy 



[pu\Y{u\y)PY(y)] dy ) du 



oo / roo 



oo \j — oo 



pu(u)du 



y dy \- Pu \ Y ^ u ^ PY ^] dy ) du 



-Pu(u) 



as was claimed. It remains to establish the necessary and sufficient condition for equality in (|3.364p . 
The Cauchy-Schwarz inequality for the product of A(U,Y) and T(U,Y) holds if and only if A(U,Y) = 
cT(U, Y) for some constant c£R, almost surely. This is equivalent to 

<p(U) = Y + c^~s(U - y/aY) + cp Y (Y) 
= c^f~sU + (1 - cs)Y + cp Y {Y) 

for some cel. In fact, c must be nonzero, for otherwise we will have (f(U) = Y, which is not a valid 
estimator. But then it must be the case that (1 — cs)Y + cpy(Y) is independent of Y, i.e., there exists 
some other constant d G K, such that 

pY\y) = — r\ =- + (*- 1 c )v- 
PY{y) c 

In other words, the score must be an affine function of y, which is the case if and only if Y is a 

Gaussian random variable. 



3.B Details on the Ornstein— Uhlenbeck semigroup 



In this appendix, we will prove the formulas (I3.84p and f|3.85|) pertaining to the Ornstein-Uhlenbeck 
semigroup. We start with (|3.84[) . Recalling that 



h t (x) = K t h(x) = E 



h [e^x + yJl-e-^Z 



we have 



h t (x) = — E 



h (e~ t x + y/l 



-2t Z 



-e^xE 



ti (e~ t x + \A -e~ 2t Z 



--2t 



+ 



-2t 



■E 



Zti [e^x + yJl-e-^Z 



For any sufficiently smooth function h and any m, a G R, 

E[Zti(m + aZ)\ = aE[h"(m + aZ)\ 
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(which is proved straightforwardly using integration by parts, provided that lim x ^± OQ e 2 h'(m + ax) 
0). Using this equality, we can write 



E 



Zti (e- f x + \A -e~ 2t z)] = y 7 ! - e~ 2t E \h" (V'x + y/l - e- 2t Z 



Therefore, 



h t (x) = -e- l x ■ K t h'(x) + e- 2t K t h"(x). 



(3.367) 



On the other hand, 

Ch t {x) = h t (x) — xh' t {x) 



e- 2t E 



h" (e^x + Vl-e-^Z 



xe~ l E 



ti (e- t x+ \/l -e~ 2t Z 



e - 2t K t h"(x) - e~ t xK t h'{x). 



(3.368) 



Comparing (133671) and (|3T368D . we get {HS}. 

The proof of the integration-by-parts formula (|3.85p is more subtle, and relies on the fact that the 
Ornstein-Uhlenbeck process {li}^ w ith Yq ~ G is stationary and reversible in the sense that, for any 
two t,t' > 0, (Y t ,Y t ,) = (Y t i,Y t ). To see this, let 



p {t) {y\x) 



: exp 



(y — e l x) 2 
2(1 - e~ 2t ) 



v/27r(l - e- 2 *) 

be the transition density of the OU(t) channel. Then it is not hard to establish that 

p {t) (y\x)j(x) =p {t) (x\y)j(y), Vx,y £ R 

(recall that 7 denotes the standard Gaussian pdf). For Z ~ G and any two smooth functions g, h, this 
implies that 

E[g{Z)K t h(Z)\ = E[g(Y )K t h(Y )} 

= E[g(Y )E[h(Y t )\Y ]) 
= E[g(Y )h(Y t )] 
= E[g(Y t )h(Y )} 
= E[K t g(Y )h(Y )} 
= E[K t g(Z)h(Z)], 

where we have used (13.801) and the reversibility property of the Ornstein-Uhlenbeck process. Taking the 
derivative of both sides w.r.t. t, we conclude that 

E[g(Z)Ch{Z)} =E[Cg{Z)h(Z)\. (3.369) 

In particular, since £1 = (where on the left-hand side 1 denotes the constant function x 1— > 1), we have 

E[£g(Z)] = E[lCg(Z)} = E[g(Z)£l] = (3.370) 

for all smooth g. 



Remark 56. If we consider the Hilbert space L 2 (G) of all functions g : 



such that E[g 2 (Z)\ < 00 



with Z ~ G, then (13.3690 expresses the fact that £ is a self-adjoint linear operator on this space. 
Moreover, (|3.370p shows that the constant functions are in the kernel of C (the closed linear subspace of 
L 2 (G) consisting of all g with Cg = 0). 
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We are now ready to prove (|3.85p . To that end, let us first define the operator T on pairs of functions 
9,h by 

T(g, h)^ [C(gh) - gCh - hCg] . (3.371) 

Remark 57. This operator was introduced into the study of Markov processes by Paul Meyer under the 
name "carre du champ" (French for "square of the field"). In the general theory, C can be any linear 
operator that serves as an infinitesimal generator of a Markov semigroup. Intuitively, V measures how 
far a given C is from being a derivation, where we say that an operator C acting on a function space is a 
derivation (or that it satisfies the Leibniz rule) if, for any g, h in its domain, 

C{gh) = gCh + hCg. 

An example of a derivation is the first-order linear differential operator Cg = g' , in which case the Leibniz 
rule is simply the product rule of differential calculus. 

Now, for our specific definition of C, we have 

T(g,h)(x) = X - [(gh)"(x) - x(gh)'(x) - g(x){h"(x) - xh'(x)) - h(x){g"(x) - xg'(x))] 
= 2 9"(x)h(x) + 2g'(x)h'(x) +g(x)h"(x) 

— xg'(x)h(x) — xg(x)h'(x) — g(x)h"(x) + xg(x)h'(x) — g"(x)h(x) + xg'(x)h(x) 
= g'(x)h'(x), (3.372) 

or, more succinctly, T(g,h) = g'h! . Therefore, 

E[g(Z)Ch(Z)} = ±{E[g(Z)£h(Z)} + E[h(Z)£g(Z)]} (3.373) 

= ^E[C(gh)(Z)\ - E[T(g, h)(Z)] (3.374) 
= -E[g'(Z)h'(Z)}, (3.375) 

where (|3373|) uses (133691) . (|33?H) uses the definition ([3371]) of T, and (13375]) uses (133721) together with 
(133701) . This proves (1335]) . 



3.C Fano's inequality for list decoding 

The following generalization of Fano's inequality has been used in the proof of Theorem [J5j Let X and 
y be finite sets, and let (X, Y) G X x y be a pair of jointly distributed random variables. Consider an 
arbitrary mapping L : y -> 2 X which maps any y G y to a set L(y) C X. Let P e = P (X L(Y)). Then 

H{X\Y) < h(P c ) + (1 - P e )E[ln\L(Y)\] + P c ln\X\ (3.376) 

(see, e.g., |171j or [180] Lemma 1]). 

To prove (|3.376|) . define the indicator random variable E = l{x^L(y)}- Then we can expand the 
conditional entropy H(E,X\Y) in two ways as 



H(E, X\Y) = H(E\Y) + H(X\E, Y) 
= H(X\Y) + H(E\X,Y). 



(3.377a) 
(3.377b) 
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Since X and Y uniquely determine E (for the given L), the quantity on the right-hand side of (|3.377b|) 
is equal to H(X\Y). On the other hand, we can upper-bound the right-hand side of (13.377a|) as 

H(E\Y) + H(X\E, Y) < H(E) + H(X\E, Y) 

= h(P c ) + F(E = 0)H(X\E = 0, Y) + P(E = 1)H(X\E = 1, Y) 
< h(P c ) + (1 - P e )E [In \L(Y)\] + P c In \X\, 

where the last line uses the fact that when E = (resp, E = 1), the uncertainty about X is at most 
E[ln|L(y)|] (respectively, ln|^|). More precisely, 

H(X\E = 0, Y) = - P(y = y, E = 0) P(X = x|y = y,E = 0) InPpf = x|y = y, £ = 0) 
yey xex 

= -J^P(y = y,£ = 0) ^ P(X = x|y = y) lnP(X = x\Y = y) 
<J2nY = y,E = 0) ln|L(y)| 

yey 

<^P(y = y) ln\L(y)\ 
yey 

= E[ln\L(Y)\]. 

In particular, when L is such that L(Y) < N a.s., we can apply Jensen's inequality to the second term 
on the right-hand side of (j3.376p to get 

H(X\Y) < h(P c ) + (1 - P e )]nN + P e ln\X\. 

This is precisely the inequality we used to derive the bound (I3.280P in the proof of Theorem HSJ 

3.D Details for the derivation of (13.3061) 

Let X n ~ Px n and Y n G y n be the input and output sequences of a DMC with transition matrix 
T : X — > y, where the DMC is used without feedback. In other words, (X n ,Y n ) G X n x y n is a random 
variable with X n ~ Px n and 

n 

P Y u lXn (y n \x n ) = IJiV[x(wki), Vy n G 3^ n , Vx n G X n s.t. P*n(x n ) > 0. 
i=i 

Because the channel is memoryless and there is no feedback, the ith output symbol y G y depends 
only on the ith input symbol Xi G X and not on the rest of the input symbols X*. Consequently, 
y* — > Xi — > Yi is a Markov chain for every i = 1, . . . , n, so we can write 

= E ^y|x(yk)^ |F W) (3.379) 

x&X 

for all y G y and all y* G y n ~ l such that P^(y*) > 0. Therefore, for any two y, y' G y we have 



= In ^ P y | X (y|x)P Xi|Yi (x|r) - In ^ Py^y'l^P^ (x|y* 
< maxlnPyi^(y|x) — minlnPyij^ (y'\x). 

x£X x£X 
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Interchanging the roles of y and y' , we get 



In 



Yi\Y 



{y'\t 



< max In • 



P Y \x(y'\x) 



P YA yi{y\t) *.*'e* P Y \x{y\x')' 



This implies, in turn, that 



In 



P YAT {y\t 



< max max 

x,x'eX y,y'ey 



In 



Py\x(y\x) 



P Y \x(y'\x') 



\ciT) 



for all y, y' G y. 
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